先上stack的定义
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html
def stack(self, level=-1, dropna=True):
"""
Stack the prescribed level(s) from columns to index.
Return a reshaped DataFrame or Series having a multi-level
index with one or more new inner-most levels compared to the current
DataFrame. The new inner-most levels are created by pivoting the
columns of the current dataframe:
- if the columns have a single level, the output is a Series;
- if the columns have multiple levels, the new index
level(s) is (are) taken from the prescribed level(s) and
the output is a DataFrame.
Parameters
----------
level : int, str, list, default -1
Level(s) to stack from the column axis onto the index
axis, defined as one index or label, or a list of indices
or labels.
dropna : bool, default True
Whether to drop rows in the resulting Frame/Series with
missing values. Stacking a column level onto the index
axis can create combinations of index and column values
that are missing from the original dataframe. See Examples
section.
Returns
-------
DataFrame or Series
Stacked dataframe or series.
"""
stack
主要是把列转成索引,返回一个变形后的 dataframe
(原列是多层的)或 series
(原列的单层的)。
参数只有 level 和 dropna 两个。
level
就是需要转的层级,可以传入int类型,层级的最上层是0,所以默认值-1就是指最下面那层;还可以传入str,需要设置columns.names;还可以传入list,就是转多层,注意顺序。
dropna
是决定是否丢弃全是missing values的行,什么意思呢,看下面例子:
In [10]: df_multi_level_cols3 = pd.DataFrame([[None, 1.0], [2.0, 3.0]],
...: index=['cat', 'dog'],
...: columns=multicol2)
In [11]: df_multi_level_cols3
Out[11]:
weight height
kg m
cat NaN 1.0
dog 2.0 3.0
In [12]: df_multi_level_cols3.stack(dropna=False)
Out[12]:
height weight
cat kg NaN NaN
m 1.0 NaN
dog kg NaN 2.0
m 3.0 NaN
In [13]: df_multi_level_cols3.stack(dropna=True)
Out[13]:
height weight
cat m 1.0 NaN
dog kg NaN 2.0
m 3.0 NaN
细心的朋友应该能看出区别吧,没错,这就是这篇文章的重点了。
dropna参数的妙用
你现在有这样一个数据集:
In [22]: df = pd.DataFrame({'sex':['male','female','female'],
'type':['student','teacher','student'],
'age':[7,28,8]})
In [23]: df
Out[23]:
sex type age
0 male student 7
1 female teacher 28
2 female student 8
你需要根据性别和职业分组求年龄的均值,很简单对吧:
In [24]: df1 = df.groupby(['sex','type'])[['age']].mean()
In [25]: df1
Out[25]:
age
sex type
female student 8
teacher 28
male student 7
诶,这时候你发现,因为这个数据集没有male的teacher,产生的缺失值导致你这个表少了一行,但是你老板就偏偏要你做成这种样式的excel表格:
咋办呢,你会想到用reindex对索引进行排序:
In [34]: df1.reindex(index=['student','teacher'], level=1)
Out[34]:
age
sex type
female student 8
teacher 28
male student 7
可惜,reindex对多层和单层的作用好像不太相同,结果并不理想。
你又想到了直接构建multiIndex来排序:
In [35]: multicol1 = pd.MultiIndex.from_product([('male', 'female'),
...: ('student', 'teacher')])
In [36]: multicol1
Out[36]:
MultiIndex([( 'male', 'student'),
( 'male', 'teacher'),
('female', 'student'),
('female', 'teacher')],
)
In [37]: df1.reindex(index=multicol1)
Out[37]:
age
male student 7.0
teacher NaN
female student 8.0
teacher 28.0
成功了,但是实际生产中的索引并不会这么简单,构建索引的时候可能会踩到别的坑。
那么,接下来我们来用 stack
实现:
In [38]: df1.unstack().stack(dropna=False)
Out[38]:
age
sex type
female student 8.0
teacher 28.0
male student 7.0
teacher NaN
OHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH!