1 使用 pivot 重塑数据
数据通常以 堆叠 或 记录 的格式存储。
例如,有如下数据,表示不同时间不同变量的观察值
In [1]: df
Out[1]:
date variable value
0 2000-01-03 A 0.469112
1 2000-01-04 A -0.282863
2 2000-01-05 A -1.509059
3 2000-01-03 B -1.135632
4 2000-01-04 B 1.212112
5 2000-01-05 B -0.173215
6 2000-01-03 C 0.119209
7 2000-01-04 C -1.044236
8 2000-01-05 C -0.861849
9 2000-01-03 D -2.104569
10 2000-01-04 D -0.494929
11 2000-01-05 D 1.071804
我们可以选择变量 A
所对应的所有数据
In [2]: df[df["variable"] == "A"]
Out[2]:
date variable value
0 2000-01-03 A 0.469112
1 2000-01-04 A -0.282863
2 2000-01-05 A -1.509059
假设我们现在想对变量进行时间序列分析,但是,这种数据的表现形式显然是不利于我们进行分析的。
如果我们能将数据转换为 DataFrame
形式,以时间序列作为索引,变量名作为列名 ,索引与列名所标识的值是对应的观察值,更加有利于我们的分析
我们可以使用 DataFrame.pivot()
或顶层的 pd.pivot()
方法来实现
In [3]: df.pivot(index="date", columns="variable", values="value")
Out[3]:
variable A B C D
date
2000-01-03 0.469112 -1.135632 0.119209 -2.104569
2000-01-04 -0.282863 1.212112 -1.044236 -0.494929
2000-01-05 -1.509059 -0.173215 -0.861849 1.071804
如果省略了 values
参数,且除了指定为索引和列名的列之外,剩余的列超过一个,则输出的结果 DataFrame
中将会包含层次列。
例如
In [4]: df["value2"] = df["value"] * 2
In [5]: pivoted = df.pivot(index="date", columns="variable")
In [6]: pivoted
Out[6]:
value value2
variable A B C D A B C D
date
2000-01-03 0.469112 -1.135632 0.119209 -2.104569 0.938225 -2.271265 0.238417 -4.209138
2000-01-04 -0.282863 1.212112 -1.044236 -0.494929 -0.565727 2.424224 -2.088472 -0.989859
2000-01-05 -1.509059 -0.173215 -0.861849 1.071804 -3.018117 -0.346429 -1.723698 2.143608
你可以从结果中提取出数据子集
In [7]: pivoted["value2"]
Out[7]:
variable A B C D
date
2000-01-03 0.938225 -2.271265 0.238417 -4.209138
2000-01-04 -0.565727 2.424224 -2.088472 -0.989859
2000-01-05 -3.018117 -0.346429 -1.723698 2.143608
注意:
如果指定的索引或列包含重复值,pivot
将会抛出 ValueError: Index contains duplicate entries, cannot reshape
异常。
可以使用其泛化的函数 pivot_table()
,它支持索引或列重复值
2 使用 stack 和 unstack 重塑数据
与 pivot()
方法密切相关的是 Series
和 DataFrame
上都可以用的 stack()
和 unstack()
方法
这两个方法主要用于 MultiIndex
对象
stack
: 将一列(或层次列名的某一个level
)映射到索引的最内层level
unstack
:stack
的逆向操作,将索引最内层的level
映射为一列
让我们来看一个例子,就很清晰了
In [8]: tuples = list(
...: zip(
...: *[
...: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
...: ["one", "two", "one", "two", "one", "two", "one", "two"],
...: ]
...: )
...: )
...:
In [9]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
In [10]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
In [11]: df2 = df[:4]
In [12]: df2
Out[12]:
A B
first second
bar one 0.721555 -0.706771
two -1.039575 0.271860
baz one -0.424972 0.567020
two 0.276232 -1.087401
执行 stack
函数会将 DataFrame
进行压缩,便会产生一个
- 只是简单的列名会生成一个
Series
MultiIndex
形式的列名会生成一个DataFrame
如果列是 MultiIndex
形式,可以选择某一 level
进行 stack
In [13]: stacked = df2.stack()
In [14]: stacked
Out[14]:
first second
bar one A 0.721555
B -0.706771
two A -1.039575
B 0.271860
baz one A -0.424972
B 0.567020
two A 0.276232
B -1.087401
dtype: float64
我们可以使用 unstack
逆转回来,默认会将最内层的 level
进行转换
In [15]: stacked.unstack()
Out[15]:
A B
first second
bar one 0.721555 -0.706771
two -1.039575 0.271860
baz one -0.424972 0.567020
two 0.276232 -1.087401
对指定 level
进行 unstack
In [16]: stacked.unstack(1)
Out[16]:
second one two
first
bar A 0.721555 -1.039575
B -0.706771 0.271860
baz A -0.424972 0.276232
B 0.567020 -1.087401
In [17]: stacked.unstack(0)
Out[17]:
first bar baz
second
one A 0.721555 -0.424972
B -0.706771 0.567020
two A -1.039575 0.276232
B 0.271860 -1.087401
当然,如果索引有名称,也可以直接传入索引名
In [18]: stacked.unstack("second")
Out[18]:
second one two
first
bar A 0.721555 -1.039575
B -0.706771 0.271860
baz A -0.424972 0.276232
B 0.567020 -1.087401
注意,stack
和 unstack
方法隐式地对涉及的索引级别进行排序。因此,先调用 stack
,然后调用 unstack
,或者反过来,将得到原始 DataFrame
或 Series
的排序拷贝
In [19]: index = pd.MultiIndex.from_product([[2, 1], ["a", "b"]])
In [20]: df = pd.DataFrame(np.random.randn(4), index=index, columns=["A"])
In [21]: df
Out[21]:
A
2 a -0.370647
b -1.157892
1 a -1.344312
b 0.844885
In [22]: all(df.unstack().stack() == df.sort_index())
Out[22]: True
如果不调用 sort_index
,则上面的代码将引发 TypeError
2.1 多个级别
您还可以通过传递一个级别列表来 stack
或 unstack
多个级别,在这种情况下,最终结果就好像按列表中的顺序每次处理一个 level
一样
In [23]: columns = pd.MultiIndex.from_tuples(
....: [
....: ("A", "cat", "long"),
....: ("B", "cat", "long"),
....: ("A", "dog", "short"),
....: ("B", "dog", "short"),
....: ],
....: names=["exp", "animal", "hair_length"],
....: )
....:
In [24]: df = pd.DataFrame(np.random.randn(4, 4), columns=columns)
In [25]: df
Out[25]:
exp A B A B
animal cat cat dog dog
hair_length long long short short
0 1.075770 -0.109050 1.643563 -1.469388
1 0.357021 -0.674600 -1.776904 -0.968914
2 -1.294524 0.413738 0.276662 -0.472035
3 -0.013960 -0.362543 -0.006154 -0.923061
In [26]: df.stack(level=["animal", "hair_length"])
Out[26]:
exp A B
animal hair_length
0 cat long 1.075770 -0.109050
dog short 1.643563 -1.469388
1 cat long 0.357021 -0.674600
dog short -1.776904 -0.968914
2 cat long -1.294524 0.413738
dog short 0.276662 -0.472035
3 cat long -0.013960 -0.362543
dog short -0.006154 -0.923061
级别列表可以包含级别名称或级别编号(但不能同时包含两者)。上面的代码也可以是
In [27]: df.stack(level=[1, 2])
Out[27]:
exp A B
animal hair_length
0 cat long 1.075770 -0.109050
dog short 1.643563 -1.469388
1 cat long 0.357021 -0.674600
dog short -1.776904 -0.968914
2 cat long -1.294524 0.413738
dog short 0.276662 -0.472035
3 cat long -0.013960 -0.362543
dog short -0.006154 -0.923061
2.2 缺失值
这些函数会自动处理缺失值,并且不希望层次索引中的每个子组具有相同的标签集,同时它们也可以处理未排序的索引
例如,下面这个例子
In [28]: columns = pd.MultiIndex.from_tuples(
....: [
....: ("A", "cat"),
....: ("B", "dog"),
....: ("B", "cat"),
....: ("A", "dog"),
....: ],
....: names=["exp", "animal"],
....: )
....:
In [29]: index = pd.MultiIndex.from_product(
....: [("bar", "baz", "foo", "qux"), ("one", "two")], names=["first", "second"]
....: )
....:
In [30]: df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)
In [31]: df2 = df.iloc[[0, 1, 2, 4, 5, 7]]
In [32]: df2
Out[32]:
exp A B A
animal cat dog cat dog
first second
bar one 0.895717 0.805244 -1.206412 2.565646
two 1.431256 1.340309 -1.170299 -0.226169
baz one 0.410835 0.813850 0.132003 -0.827317
foo one -1.413681 1.607920 1.024180 0.569605
two 0.875906 -2.211372 0.974466 -2.006747
qux two -1.226825 0.769804 -1.281247 -0.727707
在 stack
中传递 level
In [33]: df2.stack("exp")
Out[33]:
animal cat dog
first second exp
bar one A 0.895717 2.565646
B -1.206412 0.805244
two A 1.431256 -0.226169
B -1.170299 1.340309
baz one A 0.410835 -0.827317
B 0.132003 0.813850
foo one A -1.413681 0.569605
B 1.024180 1.607920
two A 0.875906 -2.006747
B 0.974466 -2.211372
qux two A -1.226825 -0.727707
B -1.281247 0.769804
In [34]: df2.stack("animal")
Out[34]:
exp A B
first second animal
bar one cat 0.895717 -1.206412
dog 2.565646 0.805244
two cat 1.431256 -1.170299
dog -0.226169 1.340309
baz one cat 0.410835 0.132003
dog -0.827317 0.813850
foo one cat -1.413681 1.024180
dog 0.569605 1.607920
two cat 0.875906 0.974466
dog -2.006747 -2.211372
qux two cat -1.226825 -1.281247
dog -0.727707 0.769804
我们提取一部分数据,然后进行 unstack
。如果对应标签的值不存在,默认会使用填充为缺失值,如 NaN
,NaT
等。
In [35]: df3 = df.iloc[[0, 1, 4, 7], [1, 2]]
In [36]: df3
Out[36]:
exp B
animal dog cat
first second
bar one 0.805244 -1.206412
two 1.340309 -1.170299
foo one 1.607920 1.024180
qux two 0.769804 -1.281247
In [37]: df3.unstack()
Out[37]:
exp B
animal dog cat
second one two one two
first
bar 0.805244 1.340309 -1.206412 -1.170299
foo 1.607920 NaN 1.024180 NaN
qux NaN 0.769804 NaN -1.281247
unstack
接受一个可选参数 fill_value
,用于指定用于填充缺失数据的值
In [38]: df3.unstack(fill_value=-1e9)
Out[38]:
exp B
animal dog cat
second one two one two
first
bar 8.052440e-01 1.340309e+00 -1.206412e+00 -1.170299e+00
foo 1.607920e+00 -1.000000e+09 1.024180e+00 -1.000000e+09
qux -1.000000e+09 7.698036e-01 -1.000000e+09 -1.281247e+00
2.3 MultiIndex 列
对于 MultiIndex
形式的列名的 DataFrame
In [39]: df[:3].unstack(0)
Out[39]:
exp A B A
animal cat dog cat dog
first bar baz bar baz bar baz bar baz
second
one 0.895717 0.410835 0.805244 0.81385 -1.206412 0.132003 2.565646 -0.827317
two 1.431256 NaN 1.340309 NaN -1.170299 NaN -0.226169 NaN
In [40]: df2.unstack(1)
Out[40]:
exp A B A
animal cat dog cat dog
second one two one two one two one two
first
bar 0.895717 1.431256 0.805244 1.340309 -1.206412 -1.170299 2.565646 -0.226169
baz 0.410835 NaN 0.813850 NaN 0.132003 NaN -0.827317 NaN
foo -1.413681 0.875906 1.607920 -2.211372 1.024180 0.974466 0.569605 -2.006747
qux NaN -1.226825 NaN 0.769804 NaN -1.281247 NaN -0.727707
3 使用 melt 重塑 DataFrame
顶层的 melt()
以及对应的 DataFrame.melt()
方法,也可以对 DataFrame
进行格式化。
可以把一列或多列作为标识变量,其他所有列作为观测变量,相当于 pivot
的逆操作,只留下两列非标识列:variable
、value
可以使用 var_name
和 value_name
设置对应的列名
In [41]: cheese = pd.DataFrame(
....: {
....: "first": ["John", "Mary"],
....: "last": ["Doe", "Bo"],
....: "height": [5.5, 6.0],
....: "weight": [130, 150],
....: }
....: )
....:
In [42]: cheese
Out[42]:
first last height weight
0 John Doe 5.5 130
1 Mary Bo 6.0 150
In [43]: cheese.melt(id_vars=["first", "last"])
Out[43]:
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
In [44]: cheese.melt(id_vars=["first", "last"], var_name="quantity")
Out[44]:
first last quantity value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
当使用 melt()
转换 DataFrame
时,索引将被忽略。通过将 ignore_index
参数设置为 False
(默认为 True
),可以保留原始的索引值。
In [45]: index = pd.MultiIndex.from_tuples([("person", "A"), ("person", "B")])
In [46]: cheese = pd.DataFrame(
....: {
....: "first": ["John", "Mary"],
....: "last": ["Doe", "Bo"],
....: "height": [5.5, 6.0],
....: "weight": [130, 150],
....: },
....: index=index,
....: )
....:
In [47]: cheese
Out[47]:
first last height weight
person A John Doe 5.5 130
B Mary Bo 6.0 150
In [48]: cheese.melt(id_vars=["first", "last"])
Out[48]:
first last variable value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0
In [49]: cheese.melt(id_vars=["first", "last"], ignore_index=False)
Out[49]:
first last variable value
person A John Doe height 5.5
B Mary Bo height 6.0
A John Doe weight 130.0
B Mary Bo weight 150.0
另一种转换方法是使用 wide_to_long()
函数。它没有 melt()
灵活,但是更容易操作
In [50]: dft = pd.DataFrame(
....: {
....: "A1970": {0: "a", 1: "b", 2: "c"},
....: "A1980": {0: "d", 1: "e", 2: "f"},
....: "B1970": {0: 2.5, 1: 1.2, 2: 0.7},
....: "B1980": {0: 3.2, 1: 1.3, 2: 0.1},
....: "X": dict(zip(range(3), np.random.randn(3))),
....: }
....: )
....:
In [51]: dft["id"] = dft.index
In [52]: dft
Out[52]:
A1970 A1980 B1970 B1980 X id
0 a d 2.5 3.2 -0.121306 0
1 b e 1.2 1.3 -0.097883 1
2 c f 0.7 0.1 0.695775 2
In [53]: pd.wide_to_long(dft, ["A", "B"], i="id", j="year")
Out[53]:
X A B
id year
0 1970 -0.121306 a 2.5
1 1970 -0.097883 b 1.2
2 1970 0.695775 c 0.7
0 1980 -0.121306 d 3.2
1 1980 -0.097883 e 1.3
2 1980 0.695775 f 0.1
4 与统计函数和 groupby 结合使用
将 pivot/stack/unstack
与 groupby
以及基本的统计函数结合使用,可以展现一些非常快速数据操作
In [54]: df
Out[54]:
exp A B A
animal cat dog cat dog
first second
bar one 0.895717 0.805244 -1.206412 2.565646
two 1.431256 1.340309 -1.170299 -0.226169
baz one 0.410835 0.813850 0.132003 -0.827317
two -0.076467 -1.187678 1.130127 -1.436737
foo one -1.413681 1.607920 1.024180 0.569605
two 0.875906 -2.211372 0.974466 -2.006747
qux one -0.410001 -0.078638 0.545952 -1.219217
two -1.226825 0.769804 -1.281247 -0.727707
In [55]: df.stack().mean(1).unstack()
Out[55]:
animal cat dog
first second
bar one -0.155347 1.685445
two 0.130479 0.557070
baz one 0.271419 -0.006733
two 0.526830 -1.312207
foo one -0.194750 1.088763
two 0.925186 -2.109060
qux one 0.067976 -0.648927
two -1.254036 0.021048
# 使用不同的方法,结果与上面的相同
In [56]: df.groupby(level=1, axis=1).mean()
Out[56]:
animal cat dog
first second
bar one -0.155347 1.685445
two 0.130479 0.557070
baz one 0.271419 -0.006733
two 0.526830 -1.312207
foo one -0.194750 1.088763
two 0.925186 -2.109060
qux one 0.067976 -0.648927
two -1.254036 0.021048
In [57]: df.stack().groupby(level=1).mean()
Out[57]:
exp A B
second
one 0.071448 0.455513
two -0.424186 -0.204486
In [58]: df.mean().unstack(0)
Out[58]:
exp A B
animal
cat 0.060843 0.018596
dog -0.413580 0.232430
5 透视表
虽然 pivot()
提供了各种数据类型(字符串、数字等)的通用转换,但 pandas
也提供了 pivot_table()
用于聚合的数值数据的转换
函数 pivot_table()
可用于创建 Excel
电子表格样式的透视表
主要包含如下参数:
考虑下面的数据
In [59]: import datetime
In [60]: df = pd.DataFrame(
....: {
....: "A": ["one", "one", "two", "three"] * 6,
....: "B": ["A", "B", "C"] * 8,
....: "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 4,
....: "D": np.random.randn(24),
....: "E": np.random.randn(24),
....: "F": [datetime.datetime(2013, i, 1) for i in range(1, 13)]
....: + [datetime.datetime(2013, i, 15) for i in range(1, 13)],
....: }
....: )
....:
In [61]: df
Out[61]:
A B C D E F
0 one A foo 0.341734 -0.317441 2013-01-01
1 one B foo 0.959726 -1.236269 2013-02-01
2 two C foo -1.110336 0.896171 2013-03-01
3 three A bar -0.619976 -0.487602 2013-04-01
4 one B bar 0.149748 -0.082240 2013-05-01
.. ... .. ... ... ... ...
19 three B foo 0.690579 -2.213588 2013-08-15
20 one C foo 0.995761 1.063327 2013-09-15
21 one A bar 2.396780 1.266143 2013-10-15
22 two B bar 0.014871 0.299368 2013-11-15
23 three C bar 3.357427 -0.863838 2013-12-15
[24 rows x 6 columns]
我们可以很方便的从这个数据中生成透视表
In [62]: pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])
Out[62]:
C bar foo
A B
one A 1.120915 -0.514058
B -0.338421 0.002759
C -0.538846 0.699535
three A -1.181568 NaN
B NaN 0.433512
C 0.588783 NaN
two A NaN 1.000985
B 0.158248 NaN
C NaN 0.176180
In [63]: pd.pivot_table(df, values="D", index=["B"], columns=["A", "C"], aggfunc=np.sum)
Out[63]:
A one three two
C bar foo bar foo bar foo
B
A 2.241830 -1.028115 -2.363137 NaN NaN 2.001971
B -0.676843 0.005518 NaN 0.867024 0.316495 NaN
C -1.077692 1.399070 1.177566 NaN NaN 0.352360
In [64]: pd.pivot_table(
....: df, values=["D", "E"],
....: index=["B"],
....: columns=["A", "C"],
....: aggfunc=np.sum,
....: )
....:
Out[64]:
D E
A one three two one three two
C bar foo bar foo bar foo bar foo bar foo bar foo
B
A 2.241830 -1.028115 -2.363137 NaN NaN 2.001971 2.786113 -0.043211 1.922577 NaN NaN 0.128491
B -0.676843 0.005518 NaN 0.867024 0.316495 NaN 1.368280 -1.103384 NaN -2.128743 -0.194294 NaN
C -1.077692 1.399070 1.177566 NaN NaN 0.352360 -1.976883 1.495717 -0.263660 NaN NaN 0.872482
产生的 DataFrame
在行和列上可能会生成层次索引,如果未指定 values
参数,则默认会把剩余能够整合的列添加到额外的列索引上
In [65]: pd.pivot_table(df, index=["A", "B"], columns=["C"])
Out[65]:
D E
C bar foo bar foo
A B
one A 1.120915 -0.514058 1.393057 -0.021605
B -0.338421 0.002759 0.684140 -0.551692
C -0.538846 0.699535 -0.988442 0.747859
three A -1.181568 NaN 0.961289 NaN
B NaN 0.433512 NaN -1.064372
C 0.588783 NaN -0.131830 NaN
two A NaN 1.000985 NaN 0.064245
B 0.158248 NaN -0.097147 NaN
C NaN 0.176180 NaN 0.436241
也可以为 index
和 columns
参数指定 Grouper
In [66]: pd.pivot_table(df, values="D", index=pd.Grouper(freq="M", key="F"), columns="C")
Out[66]:
C bar foo
F
2013-01-31 NaN -0.514058
2013-02-28 NaN 0.002759
2013-03-31 NaN 0.176180
2013-04-30 -1.181568 NaN
2013-05-31 -0.338421 NaN
2013-06-30 -0.538846 NaN
2013-07-31 NaN 1.000985
2013-08-31 NaN 0.433512
2013-09-30 NaN 0.699535
2013-10-31 1.120915 NaN
2013-11-30 0.158248 NaN
2013-12-31 0.588783 NaN
你可以通过调用 to_string
将输出渲染为字符串,省略掉缺失的值
In [67]: table = pd.pivot_table(df, index=["A", "B"], columns=["C"])
In [68]: print(table.to_string(na_rep=""))
D E
C bar foo bar foo
A B
one A 1.120915 -0.514058 1.393057 -0.021605
B -0.338421 0.002759 0.684140 -0.551692
C -0.538846 0.699535 -0.988442 0.747859
three A -1.181568 0.961289
B 0.433512 -1.064372
C 0.588783 -0.131830
two A 1.000985 0.064245
B 0.158248 -0.097147
C 0.176180 0.436241
pivot_table
也有对应的实例方法,如 DataFrame.pivot_table()
5.1 添加 margins 参数
如果在 pivot_table
中设置了 margins=True
,会在输出结果中添加名为 All
的行和列,作为分类数据的汇总
In [69]: df.pivot_table(index=["A", "B"], columns="C", margins=True, aggfunc=np.std)
Out[69]:
D E
C bar foo All bar foo All
A B
one A 1.804346 1.210272 1.569879 0.179483 0.418374 0.858005
B 0.690376 1.353355 0.898998 1.083825 0.968138 1.101401
C 0.273641 0.418926 0.771139 1.689271 0.446140 1.422136
three A 0.794212 NaN 0.794212 2.049040 NaN 2.049040
B NaN 0.363548 0.363548 NaN 1.625237 1.625237
C 3.915454 NaN 3.915454 1.035215 NaN 1.035215
two A NaN 0.442998 0.442998 NaN 0.447104 0.447104
B 0.202765 NaN 0.202765 0.560757 NaN 0.560757
C NaN 1.819408 1.819408 NaN 0.650439 0.650439
All 1.556686 0.952552 1.246608 1.250924 0.899904 1.059389
6 交叉表
使用 crosstab()
可以计算两个或多个因子的交叉表。默认会统计因子的频率,除非传入了一组数组值和聚合函数
参数如下:
除非为交叉表指定了行或列的名称,否则传递的任何 Series
都将使用其 name
属性命名
例如
In [70]: foo, bar, dull, shiny, one, two = "foo", "bar", "dull", "shiny", "one", "two"
In [71]: a = np.array([foo, foo, bar, bar, foo, foo], dtype=object)
In [72]: b = np.array([one, one, two, one, two, one], dtype=object)
In [73]: c = np.array([dull, dull, shiny, dull, dull, shiny], dtype=object)
In [74]: pd.crosstab(a, [b, c], rownames=["a"], colnames=["b", "c"])
Out[74]:
b one two
c dull shiny dull shiny
a
bar 1 0 0 1
foo 2 1 1 0
如果 crosstab
只传递了两个 Series
,将会返回一个频数表
In [75]: df = pd.DataFrame(
....: {"A": [1, 2, 2, 2, 2], "B": [3, 3, 4, 4, 4], "C": [1, 1, np.nan, 1, 1]}
....: )
....:
In [76]: df
Out[76]:
A B C
0 1 3 1.0
1 2 3 1.0
2 2 4 NaN
3 2 4 1.0
4 2 4 1.0
In [77]: pd.crosstab(df["A"], df["B"])
Out[77]:
B 3 4
A
1 1 0
2 1 3
也可以传入 Categorical
类型的数据
In [78]: foo = pd.Categorical(["a", "b"], categories=["a", "b", "c"])
In [79]: bar = pd.Categorical(["d", "e"], categories=["d", "e", "f"])
In [80]: pd.crosstab(foo, bar)
Out[80]:
col_0 d e
row_0
a 1 0
b 0 1
如果你想包含所有的类别信息,即使实际数据不包含任何特定类别的实例,你可以设置 dropna=False
In [81]: pd.crosstab(foo, bar, dropna=False)
Out[81]:
col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0
6.1 标准化
频率表也可以使用 normalize
参数进行标准化,以显示百分比而不是计数
In [82]: pd.crosstab(df["A"], df["B"], normalize=True)
Out[82]:
B 3 4
A
1 0.2 0.0
2 0.2 0.6
normalize
还可以对每一行或每一列中的值进行标准化
In [83]: pd.crosstab(df["A"], df["B"], normalize="columns")
Out[83]:
B 3 4
A
1 0.5 0.0
2 0.5 1.0
crosstab
还可以传递第三个 Series
和一个聚合函数(aggfunc
),使用前两个 Series
定义的组和对第三个 Series
分组,并在每个分组中应用聚合函数
In [84]: pd.crosstab(df["A"], df["B"], values=df["C"], aggfunc=np.sum)
Out[84]:
B 3 4
A
1 1.0 NaN
2 1.0 2.0
6.2 添加 margins 参数
该函数也支持 margins
参数
In [85]: pd.crosstab(
....: df["A"], df["B"], values=df["C"], aggfunc=np.sum, normalize=True, margins=True
....: )
....:
Out[85]:
B 3 4 All
A
1 0.25 0.0 0.25
2 0.25 0.5 0.75
All 0.50 0.5 1.00
7 分片
cut()
函数计算输入数组值的分组,通常用于将连续变量转化为离散变量或分类变量。
In [86]: ages = np.array([10, 15, 13, 12, 23, 25, 28, 59, 60])
In [87]: pd.cut(ages, bins=3)
Out[87]:
[(9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (26.667, 43.333], (43.333, 60.0], (43.333, 60.0]]
Categories (3, interval[float64]): [(9.95, 26.667] < (26.667, 43.333] < (43.333, 60.0]]
如果 bins
关键字是整数,则将形成等宽的 bin
。我们也可以指定自定义边界
In [88]: c = pd.cut(ages, bins=[0, 18, 35, 70])
In [89]: c
Out[89]:
[(0, 18], (0, 18], (0, 18], (0, 18], (18, 35], (18, 35], (18, 35], (35, 70], (35, 70]]
Categories (3, interval[int64]): [(0, 18] < (18, 35] < (35, 70]]
如果关键字 bins
是 IntervalIndex
,那么将对传入的数据进行变换
>>> pd.cut([25, 20, 50], bins=c.categories)
[(18, 35], (18, 35], (35, 70]]
Categories (3, interval[int64]): [(0, 18] < (18, 35] < (35, 70]]
8 计算标记变量
我们可以将类别型的变量转换为一种标记型的 DataFrame
。例如, DataFrame
的一列或一个 Series
对象,包含 k
个不同的值。
使用 get_dummies
可以将其转换为一个包含 k
列 0
、1
值的 DataFrame
In [90]: df = pd.DataFrame({"key": list("bbacab"), "data1": range(6)})
In [91]: pd.get_dummies(df["key"])
Out[91]:
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
key
这一列包含 6
个值,有 3
个不重复的值,使用 get_dummies
后将其转换为了 6×3
的 DataFrame
,值为 1
指示了对应位置出现了该元素
可以为这些列名添加一个前缀
In [92]: dummies = pd.get_dummies(df["key"], prefix="key")
In [93]: dummies
Out[93]:
key_a key_b key_c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
In [94]: df[["data1"]].join(dummies)
Out[94]:
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
这个函数经常与 cut
函数一起使用
In [95]: values = np.random.randn(10)
In [96]: values
Out[96]:
array([ 0.4082, -1.0481, -0.0257, -0.9884, 0.0941, 1.2627, 1.29 ,
0.0824, -0.0558, 0.5366])
In [97]: bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
In [98]: pd.get_dummies(pd.cut(values, bins))
Out[98]:
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
0 0 0 1 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
7 1 0 0 0 0
8 0 0 0 0 0
9 0 0 1 0 0
get_dummies()
也可以传入一个 DataFrame
。
默认情况下,所有的分类变量都被编码为虚拟变量
In [99]: df = pd.DataFrame({"A": ["a", "b", "a"], "B": ["c", "c", "b"], "C": [1, 2, 3]})
In [100]: pd.get_dummies(df)
Out[100]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
所有的非对象列都会被原样输出,你可以用 columns
关键字控制被编码的列
In [101]: pd.get_dummies(df, columns=["A"])
Out[101]:
B C A_a A_b
0 c 1 1 0
1 c 2 0 1
2 b 3 1 0
与 Series
一样,可以使用 prefix
和 prefix_sep
设置列名前缀和前缀分隔符。
默认情况下,使用原来的列名作为前缀,使用 _
作为分隔符,支持三种设置方式
- 字符串:每列使用相同的前缀
- 列表:与列名长度一致
- 字典:列名映射到前缀的字典
In [102]: simple = pd.get_dummies(df, prefix="new_prefix")
In [103]: simple
Out[103]:
C new_prefix_a new_prefix_b new_prefix_b new_prefix_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
In [104]: from_list = pd.get_dummies(df, prefix=["from_A", "from_B"])
In [105]: from_list
Out[105]:
C from_A_a from_A_b from_B_b from_B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
In [106]: from_dict = pd.get_dummies(df, prefix={"B": "from_B", "A": "from_A"})
In [107]: from_dict
Out[107]:
C from_A_a from_A_b from_B_b from_B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
有时,将结果提交给统计模型时,只要保留分类变量的 k-1
个水平,以避免共线性。
可以使用 drop_first
开启此功能
In [108]: s = pd.Series(list("abcaa"))
In [109]: pd.get_dummies(s)
Out[109]:
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 1 0 0
In [110]: pd.get_dummies(s, drop_first=True)
Out[110]:
b c
0 0 0
1 1 0
2 0 1
3 0 0
4 0 0
当一个列只包含一个分类级别时,它将在结果中被省略
In [111]: df = pd.DataFrame({"A": list("aaaaa"), "B": list("ababc")})
In [112]: pd.get_dummies(df)
Out[112]:
A_a B_a B_b B_c
0 1 1 0 0
1 1 0 1 0
2 1 1 0 0
3 1 0 1 0
4 1 0 0 1
In [113]: pd.get_dummies(df, drop_first=True)
Out[113]:
B_b B_c
0 0 0
1 1 0
2 0 0
3 1 0
4 0 1
默认情况下,新列是 np.uint8
类型,可以使用 dtype
参数设置类型
In [114]: df = pd.DataFrame({"A": list("abc"), "B": [1.1, 2.2, 3.3]})
In [115]: pd.get_dummies(df, dtype=bool).dtypes
Out[115]:
B float64
A_a bool
A_b bool
A_c bool
dtype: object
9 因子化值
可以使用 factorize()
将 一维的值编码为一个枚举类型
In [116]: x = pd.Series(["A", "A", np.nan, "B", 3.14, np.inf])
In [117]: x
Out[117]:
0 A
1 A
2 NaN
3 B
4 3.14
5 inf
dtype: object
In [118]: labels, uniques = pd.factorize(x)
In [119]: labels
Out[119]: array([ 0, 0, -1, 1, 2, 3])
In [120]: uniques
Out[120]: Index(['A', 'B', 3.14, inf], dtype='object')