百分之99的人都不知道关于pandas的stack的冷知识

最新推荐文章于 2024-06-18 11:38:04 发布

soul booster

最新推荐文章于 2024-06-18 11:38:04 发布

阅读量428

点赞数

分类专栏：工作学习文章标签： python pandas

本文链接：https://blog.csdn.net/qq_41967784/article/details/118192923

版权

学习同时被 2 个专栏收录

37 篇文章 0 订阅

订阅专栏

工作

6 篇文章 0 订阅

订阅专栏

先上stack的定义

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html

    def stack(self, level=-1, dropna=True):
        """
        Stack the prescribed level(s) from columns to index.
        Return a reshaped DataFrame or Series having a multi-level
        index with one or more new inner-most levels compared to the current
        DataFrame. The new inner-most levels are created by pivoting the
        columns of the current dataframe:
          - if the columns have a single level, the output is a Series;
          - if the columns have multiple levels, the new index
            level(s) is (are) taken from the prescribed level(s) and
            the output is a DataFrame.
        Parameters
        ----------
        level : int, str, list, default -1
            Level(s) to stack from the column axis onto the index
            axis, defined as one index or label, or a list of indices
            or labels.
        dropna : bool, default True
            Whether to drop rows in the resulting Frame/Series with
            missing values. Stacking a column level onto the index
            axis can create combinations of index and column values
            that are missing from the original dataframe. See Examples
            section.
        Returns
        -------
        DataFrame or Series
            Stacked dataframe or series.
        """

stack 主要是把列转成索引，返回一个变形后的 dataframe （原列是多层的）或 series （原列的单层的）。

参数只有 level 和 dropna 两个。

level 就是需要转的层级，可以传入int类型，层级的最上层是0，所以默认值-1就是指最下面那层；还可以传入str，需要设置columns.names；还可以传入list，就是转多层，注意顺序。

dropna 是决定是否丢弃全是missing values的行，什么意思呢，看下面例子：

In [10]: df_multi_level_cols3 = pd.DataFrame([[None, 1.0], [2.0, 3.0]],
    ...:                                     index=['cat', 'dog'],
    ...:                                     columns=multicol2)

In [11]: df_multi_level_cols3
Out[11]:
    weight height
        kg      m
cat    NaN    1.0
dog    2.0    3.0

In [12]: df_multi_level_cols3.stack(dropna=False)
Out[12]:
        height  weight
cat kg     NaN     NaN
    m      1.0     NaN
dog kg     NaN     2.0
    m      3.0     NaN

In [13]: df_multi_level_cols3.stack(dropna=True)
Out[13]:
        height  weight
cat m      1.0     NaN
dog kg     NaN     2.0
    m      3.0     NaN

细心的朋友应该能看出区别吧，没错，这就是这篇文章的重点了。

dropna参数的妙用

你现在有这样一个数据集：

In [22]: df = pd.DataFrame({'sex':['male','female','female'],
							'type':['student','teacher','student'],
							'age':[7,28,8]})

In [23]: df
Out[23]:
      sex     type  age
0    male  student    7
1  female  teacher   28
2  female  student    8

你需要根据性别和职业分组求年龄的均值，很简单对吧：

In [24]: df1 = df.groupby(['sex','type'])[['age']].mean()

In [25]: df1
Out[25]:
                age
sex    type
female student    8
       teacher   28
male   student    7

诶，这时候你发现，因为这个数据集没有male的teacher，产生的缺失值导致你这个表少了一行，但是你老板就偏偏要你做成这种样式的excel表格：

excel

咋办呢，你会想到用reindex对索引进行排序：

In [34]: df1.reindex(index=['student','teacher'], level=1)
Out[34]:
                age
sex    type
female student    8
       teacher   28
male   student    7

可惜，reindex对多层和单层的作用好像不太相同，结果并不理想。

你又想到了直接构建multiIndex来排序：

In [35]: multicol1 = pd.MultiIndex.from_product([('male', 'female'),
    ...:                                     ('student', 'teacher')])

In [36]: multicol1
Out[36]:
MultiIndex([(  'male', 'student'),
            (  'male', 'teacher'),
            ('female', 'student'),
            ('female', 'teacher')],
           )

In [37]: df1.reindex(index=multicol1)
Out[37]:
                 age
male   student   7.0
       teacher   NaN
female student   8.0
       teacher  28.0

成功了，但是实际生产中的索引并不会这么简单，构建索引的时候可能会踩到别的坑。

那么，接下来我们来用 stack 实现：

In [38]: df1.unstack().stack(dropna=False)
Out[38]:
                 age
sex    type
female student   8.0
       teacher  28.0
male   student   7.0
       teacher   NaN

OHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH!

soul booster

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
百分之99的人都不知道关于pandas的stack的冷知识

先上stack的定义https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html def stack(self, level=-1, dropna=True): """ Stack the prescribed level(s) from columns to index. Return a reshaped DataFrame or
复制链接

扫一扫

专栏目录