安装
pip install Pandas
导入
import numpy as np
import pandas as pd
生成对象
用值列表生成Series时,Pandas 默认自动生成整数索引
s = pd.Series([1,3,5,np.nan,6,8])
s
Out[5]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
用含日期时间索引与标签的 NumPy
数组生成 DataFrame
dates = pd.date_range('20130101',periods=6)
dates
Out[7]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
df=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df
Out[9]:
A B C D
2013-01-01 0.232311 -0.531184 0.884277 1.424803
2013-01-02 -0.481030 -0.102810 0.428284 0.355169
2013-01-03 3.400514 -0.121568 -0.673329 -0.466433
2013-01-04 -0.328470 -2.249995 -1.292375 0.356151
2013-01-05 1.558183 0.805518 0.351855 1.953801
2013-01-06 -1.877377 -0.250330 0.988246 -2.228399
用 Series 字典对象生成 DataFrame
import pandas as pd
import numpy as np
df2 = pd.DataFrame({
'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'
})
df2
Out[6]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
DataFrame 的列有不同数据类型
df2.dtypes
Out[7]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
查看数据
下列代码说明如何查看 DataFrame 头部和尾部数据:
df.head()
Out[6]:
A B C D
2013-01-01 -0.696592 -0.363500 -0.566725 0.792202
2013-01-02 0.961099 0.181294 -2.301134 -1.677969
2013-01-03 -0.911692 -0.808445 -0.127359 -0.308634
2013-01-04 0.818615 -0.676508 -0.260336 1.047999
2013-01-05 1.124423 -0.315442 -1.457229 -0.007171
df.tail(3)
Out[7]:
A B C D
2013-01-04 0.818615 -0.676508 -0.260336 1.047999
2013-01-05 1.124423 -0.315442 -1.457229 -0.007171
2013-01-06 -1.542175 -1.116586 0.547490 0.945169
显示索引与列名:
df.index
Out[8]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
df.columns
Out[9]: Index(['A', 'B', 'C', 'D'], dtype='object')
DataFrame.to_numpy()
输出底层数据的 NumPy 对象。DataFrame
的列由多种数据类型组成时,该操作耗费系统资源较大,这也是 Pandas 和 NumPy 的本质区别:NumPy 数组只有一种数据类型,DataFrame 每列的数据类型各不相同。调用 DataFrame.to_numpy()
时,Pandas 查找支持 DataFrame 里所有数据类型的 NumPy 数据类型。还有一种数据类型是 object
,可以把 DataFrame 列里的值强制转换为 Python 对象。
下面的 df
这个 DataFrame
里的值都是浮点数,DataFrame.to_numpy()
的操作会很快,而且不复制数据。
df.to_numpy()
Out[10]:
array([[-0.6965918 , -0.36350027, -0.56672495, 0.79220226],
[ 0.96109885, 0.18129362, -2.30113415, -1.67796915],
[-0.91169172, -0.80844536, -0.12735913, -0.30863412],
[ 0.81861514, -0.67650826, -0.2603358 , 1.04799941],
[ 1.12442329, -0.3154423 , -1.45722895, -0.0071707 ],
[-1.54217514, -1.11658607, 0.54749023, 0.94516896]])
df2
这个DataFrame
包含了多种类型,DataFrame.to_numpy()
操作就会耗费较多资源。
df2.to_numpy()
Out[12]:
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
tips:
DataFrame.to_numpy()
的输出不包含行索引和列标签。
describe()
可以快速查看数据的统计摘要:
df.describe()
Out[14]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.041054 -0.516531 -0.694215 0.131933
std 1.143922 0.452037 1.022941 1.041699
min -1.542175 -1.116586 -2.301134 -1.677969
25% -0.857917 -0.775461 -1.234603 -0.233268
50% 0.061012 -0.520004 -0.413530 0.392516
75% 0.925478 -0.327457 -0.160603 0.906927
max 1.124423 0.181294 0.547490 1.047999
转置数据:
df.T
Out[16]:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A -0.696592 0.961099 -0.911692 0.818615 1.124423 -1.542175
B -0.363500 0.181294 -0.808445 -0.676508 -0.315442 -1.116586
C -0.566725 -2.301134 -0.127359 -0.260336 -1.457229 0.547490
D 0.792202 -1.677969 -0.308634 1.047999 -0.007171 0.945169
按轴排序:
df.sort_index(axis=1,ascending=False)
Out[17]:
D C B A
2013-01-01 0.792202 -0.566725 -0.363500 -0.696592
2013-01-02 -1.677969 -2.301134 0.181294 0.961099
2013-01-03 -0.308634 -0.127359 -0.808445 -0.911692
2013-01-04 1.047999 -0.260336 -0.676508 0.818615
2013-01-05 -0.007171 -1.457229 -0.315442 1.124423
2013-01-06 0.945169 0.547490 -1.116586 -1.542175
按值排序:
df.sort_values(by='B')
Out[18]:
A B C D
2013-01-06 -1.542175 -1.116586 0.547490 0.945169
2013-01-03 -0.911692 -0.808445 -0.127359 -0.308634
2013-01-04 0.818615 -0.676508 -0.260336 1.047999
2013-01-01 -0.696592 -0.363500 -0.566725 0.792202
2013-01-05 1.124423 -0.315442 -1.457229 -0.007171
2013-01-02 0.961099 0.181294 -2.301134 -1.677969
获取数据
选择单列,产生 Series
,与 df.A
等效:
df.A
Out[6]:
2013-01-01 0.231505
2013-01-02 -0.194741
2013-01-03 0.284820
2013-01-04 -1.834587
2013-01-05 1.334072
2013-01-06 0.211907
Freq: D, Name: A, dtype: float64
df['A']
Out[7]:
2013-01-01 0.231505
2013-01-02 -0.194741
2013-01-03 0.284820
2013-01-04 -1.834587
2013-01-05 1.334072
2013-01-06 0.211907
Freq: D, Name: A, dtype: float64
用 [ ] 切片行:
df[0:3]
Out[8]:
A B C D
2013-01-01 0.231505 0.220624 -0.478619 0.004278
2013-01-02 -0.194741 0.814497 -0.624392 -0.215743
2013-01-03 0.284820 -2.168773 3.192018 -0.640769
df['20130102':'20130104']
Out[9]:
A B C D
2013-01-02 -0.194741 0.814497 -0.624392 -0.215743
2013-01-03 0.284820 -2.168773 3.192018 -0.640769
2013-01-04 -1.834587 0.889701 -0.869091 -0.445347
按标签选择
用标签提取一行数据:
df.loc[dates[0]]
Out[10]:
A 0.231505
B 0.220624
C -0.478619
D 0.004278
Name: 2013-01-01 00:00:00, dtype: float64
用标签选择多列数据:
df.loc[:,['A','B']]
Out[13]:
A B
2013-01-01 0.231505 0.220624
2013-01-02 -0.194741 0.814497
2013-01-03 0.284820 -2.168773
2013-01-04 -1.834587 0.889701
2013-01-05 1.334072 0.876368
2013-01-06 0.211907 -0.585302
用标签切片,包含行与列结束点:
df.loc['20130102':'20130104',['A','B']]
Out[14]:
A B
2013-01-02 -0.194741 0.814497
2013-01-03 0.284820 -2.168773
2013-01-04 -1.834587 0.889701
返回对象降维:
df.loc['20130102',['A','B']]
Out[15]:
A -0.194741
B 0.814497
Name: 2013-01-02 00:00:00, dtype: float64
提取标量值:
df.loc[dates[0],'A']
Out[16]: 0.23150508311794024
快速访问标量,与上述方法等效:
df.at[dates[0],'A']
Out[17]: 0.23150508311794024
按位置选择
用整数位置选择:
df.iloc[3]
Out[18]:
A -1.834587
B 0.889701
C -0.869091
D -0.445347
Name: 2013-01-04 00:00:00, dtype: float64
类似 NumPy / Python,用整数切片:
df.iloc[3:5,0:2]
Out[19]:
A B
2013-01-04 -1.834587 0.889701
2013-01-05 1.334072 0.876368
类似 NumPy / Python,用整数列表按位置切片:
df.iloc[[1,2,4],[0,2]]
Out[20]:
A C
2013-01-02 -0.194741 -0.624392
2013-01-03 0.284820 3.192018
2013-01-05 1.334072 0.201278
显式整行切片:
df.iloc[1:3,:]
Out[22]:
A B C D
2013-01-02 -0.194741 0.814497 -0.624392 -0.215743
2013-01-03 0.284820 -2.168773 3.192018 -0.640769
显式整列切片:
df.iloc[:,1:3]
Out[23]:
B C
2013-01-01 0.220624 -0.478619
2013-01-02 0.814497 -0.624392
2013-01-03 -2.168773 3.192018
2013-01-04 0.889701 -0.869091
2013-01-05 0.876368 0.201278
2013-01-06 -0.585302 0.929343
显式提取值:
df.iloc[1,1]
Out[24]: 0.8144969773188735
快速访问标量,与上述方法等效:
df.iat[1,1]
Out[25]: 0.8144969773188735
布尔索引
用单列的值选择数据:
df[df.A>0]
Out[26]:
A B C D
2013-01-01 0.231505 0.220624 -0.478619 0.004278
2013-01-03 0.284820 -2.168773 3.192018 -0.640769
2013-01-05 1.334072 0.876368 0.201278 -0.727291
2013-01-06 0.211907 -0.585302 0.929343 -1.169148
选择 DataFrame 里满足条件的值:
df[df>0]
Out[27]:
A B C D
2013-01-01 0.231505 0.220624 NaN 0.004278
2013-01-02 NaN 0.814497 NaN NaN
2013-01-03 0.284820 NaN 3.192018 NaN
2013-01-04 NaN 0.889701 NaN NaN
2013-01-05 1.334072 0.876368 0.201278 NaN
2013-01-06 0.211907 NaN 0.929343 NaN
用 isin()筛选:
df2 = df.copy()
df2['E'] = [ 'one','one','two','three','four','three']
df2
Out[30]:
A B C D E
2013-01-01 0.231505 0.220624 -0.478619 0.004278 one
2013-01-02 -0.194741 0.814497 -0.624392 -0.215743 one
2013-01-03 0.284820 -2.168773 3.192018 -0.640769 two
2013-01-04 -1.834587 0.889701 -0.869091 -0.445347 three
2013-01-05 1.334072 0.876368 0.201278 -0.727291 four
2013-01-06 0.211907 -0.585302 0.929343 -1.169148 three
df2[df2['E'].isin(['two','four'])]
Out[31]:
A B C D E
2013-01-03 0.284820 -2.168773 3.192018 -0.640769 two
2013-01-05 1.334072 0.876368 0.201278 -0.727291 four
赋值
用索引自动对齐新增列的数据:
s1=pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))
s1
Out[33]:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
df['F'] = s1
df
Out[35]:
A B C D F
2013-01-01 0.231505 0.220624 -0.478619 0.004278 NaN
2013-01-02 -0.194741 0.814497 -0.624392 -0.215743 1.0
2013-01-03 0.284820 -2.168773 3.192018 -0.640769 2.0
2013-01-04 -1.834587 0.889701 -0.869091 -0.445347 3.0
2013-01-05 1.334072 0.876368 0.201278 -0.727291 4.0
2013-01-06 0.211907 -0.585302 0.929343 -1.169148 5.0
按标签赋值:
df.at[dates[0], 'A']=0
df
Out[37]:
A B C D F
2013-01-01 0.000000 0.220624 -0.478619 0.004278 NaN
2013-01-02 -0.194741 0.814497 -0.624392 -0.215743 1.0
2013-01-03 0.284820 -2.168773 3.192018 -0.640769 2.0
2013-01-04 -1.834587 0.889701 -0.869091 -0.445347 3.0
2013-01-05 1.334072 0.876368 0.201278 -0.727291 4.0
2013-01-06 0.211907 -0.585302 0.929343 -1.169148 5.0
按位置赋值:
df.iat[0,1]=0
df
Out[39]:
A B C D F
2013-01-01 0.000000 0.000000 -0.478619 0.004278 NaN
2013-01-02 -0.194741 0.814497 -0.624392 -0.215743 1.0
2013-01-03 0.284820 -2.168773 3.192018 -0.640769 2.0
2013-01-04 -1.834587 0.889701 -0.869091 -0.445347 3.0
2013-01-05 1.334072 0.876368 0.201278 -0.727291 4.0
2013-01-06 0.211907 -0.585302 0.929343 -1.169148 5.0
按 NumPy 数组赋值:
df.loc[:,'D']=np.array([5]*len(df))
df
Out[41]:
A B C D F
2013-01-01 0.000000 0.000000 -0.478619 5 NaN
2013-01-02 -0.194741 0.814497 -0.624392 5 1.0
2013-01-03 0.284820 -2.168773 3.192018 5 2.0
2013-01-04 -1.834587 0.889701 -0.869091 5 3.0
2013-01-05 1.334072 0.876368 0.201278 5 4.0
2013-01-06 0.211907 -0.585302 0.929343 5 5.0
用 where
条件赋值:
df2=df.copy()
df[df2>0]=-df
df
Out[45]:
A B C D F
2013-01-01 0.000000 0.000000 -0.478619 -5 NaN
2013-01-02 -0.194741 -0.814497 -0.624392 -5 -1.0
2013-01-03 -0.284820 -2.168773 -3.192018 -5 -2.0
2013-01-04 -1.834587 -0.889701 -0.869091 -5 -3.0
2013-01-05 -1.334072 -0.876368 -0.201278 -5 -4.0
2013-01-06 -0.211907 -0.585302 -0.929343 -5 -5.0
缺失值
Pandas 主要用 np.nan
表示缺失数据。 计算时,默认不包含空值。
重建索引(reindex)可以更改、添加、删除指定轴的索引,并返回数据副本,即不更改原数据。
df1=df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
Out[49]:
A B C D F E
2013-01-01 0.000000 0.000000 -0.478619 -5 NaN 1.0
2013-01-02 -0.194741 -0.814497 -0.624392 -5 -1.0 1.0
2013-01-03 -0.284820 -2.168773 -3.192018 -5 -2.0 NaN
2013-01-04 -1.834587 -0.889701 -0.869091 -5 -3.0 NaN
删除所有含缺失值的行:
df1.dropna(how='any')
Out[50]:
A B C D F E
2013-01-02 -0.194741 -0.814497 -0.624392 -5 -1.0 1.0
填充缺失值:
df.fillna(value=5)
Out[52]:
A B C D F
2013-01-01 0.000000 0.000000 -0.478619 -5 5.0
2013-01-02 -0.194741 -0.814497 -0.624392 -5 -1.0
2013-01-03 -0.284820 -2.168773 -3.192018 -5 -2.0
2013-01-04 -1.834587 -0.889701 -0.869091 -5 -3.0
2013-01-05 -1.334072 -0.876368 -0.201278 -5 -4.0
2013-01-06 -0.211907 -0.585302 -0.929343 -5 -5.0
提取 nan
值的布尔掩码:
pd.isna(df1)
Out[53]:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
统计
一般情况下,运算时排除缺失值。
描述性统计:
df.mean()
Out[54]:
A -0.643355
B -0.889107
C -1.049123
D -5.000000
F -3.000000
dtype: float64
在另一个轴(即,行)上执行同样的操作:
df.mean(1)
Out[55]:
2013-01-01 -1.369655
2013-01-02 -1.526726
2013-01-03 -2.529122
2013-01-04 -2.318676
2013-01-05 -2.282344
2013-01-06 -2.345311
Freq: D, dtype: float64
不同维度对象运算时,要先对齐。 此外,Pandas 自动沿指定维度广播。
s=pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)
s
Out[58]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
df.sub(s,axis='index')
Out[59]:
A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -1.284820 -3.168773 -4.192018 -6.0 -3.0
2013-01-04 -4.834587 -3.889701 -3.869091 -8.0 -6.0
2013-01-05 -6.334072 -5.876368 -5.201278 -10.0 -9.0
2013-01-06 NaN NaN NaN NaN NaN
Apply 函数
Apply 函数处理数据:
df.apply(np.cumsum)
Out[60]:
A B C D F
2013-01-01 0.000000 0.000000 -0.478619 -5 NaN
2013-01-02 -0.194741 -0.814497 -1.103010 -10 -1.0
2013-01-03 -0.479562 -2.983270 -4.295029 -15 -3.0
2013-01-04 -2.314148 -3.872971 -5.164120 -20 -6.0
2013-01-05 -3.648220 -4.749339 -5.365398 -25 -10.0
2013-01-06 -3.860128 -5.334641 -6.294741 -30 -15.0
df.apply(lambda x:x.max()-x.min())
Out[61]:
A 1.834587
B 2.168773
C 2.990740
D 0.000000
F 4.000000
dtype: float64
直方图
s = pd.Series(np.random.randint(0,7,size=10))
s
Out[63]:
0 4
1 2
2 2
3 5
4 4
5 5
6 2
7 2
8 5
9 2
dtype: int32
s.value_counts()
Out[64]:
2 5
5 3
4 2
dtype: int64
字符串方法
Series 的 str
属性包含一组字符串处理功能,如下列代码所示。注意,str
的模式匹配默认使用正则表达式。
s = pd.Series(['A','B','C','Aaba',np.nan,'CABA','dog','cat'])
s.str.lower()
Out[68]:
0 a
1 b
2 c
3 aaba
4 NaN
5 caba
6 dog
7 cat
dtype: object
合并(Merge)
结合(Concat)
Pandas 提供了多种将 Series、DataFrame 对象组合在一起的功能,用索引与关联代数功能的多种设置逻辑可执行连接(join)与合并(merge)操作。
concat()
用于连接 Pandas 对象:
df = pd.DataFrame(np.random.randn(10,4))
df
Out[70]:
0 1 2 3
0 -1.994210 -1.916873 0.533659 0.049655
1 -1.775216 -0.077889 -0.371278 0.266553
2 -0.009074 -2.131895 0.595984 0.025998
3 0.267357 1.966662 0.747365 0.319554
4 -0.107444 0.857005 0.089416 0.437403
5 2.031898 -0.059433 -1.418391 -0.596198
6 -0.654181 0.444417 1.479503 1.012438
7 -0.208141 1.865477 -0.385802 -0.980770
8 -0.346811 0.890188 1.416453 0.476470
9 0.069049 0.249389 2.107487 -0.666272
pieces = [df[0:3],df[3:7],df[7:]]
pd.concat(pieces)
Out[72]:
0 1 2 3
0 -1.994210 -1.916873 0.533659 0.049655
1 -1.775216 -0.077889 -0.371278 0.266553
2 -0.009074 -2.131895 0.595984 0.025998
3 0.267357 1.966662 0.747365 0.319554
4 -0.107444 0.857005 0.089416 0.437403
5 2.031898 -0.059433 -1.418391 -0.596198
6 -0.654181 0.444417 1.479503 1.012438
7 -0.208141 1.865477 -0.385802 -0.980770
8 -0.346811 0.890188 1.416453 0.476470
9 0.069049 0.249389 2.107487 -0.666272
连接(join)
SQL 风格的合并。
left = pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})
right = pd.DataFrame({'key':['foo','foo'],'rval':[4,5]})
left
Out[75]:
key lval
0 foo 1
1 foo 2
right
Out[76]:
key rval
0 foo 4
1 foo 5
pd.merge(left,right,on='key')
Out[77]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
追加(Append)
为 DataFrame 追加行。
df = pd.DataFrame(np.random.randn(8,4),columns=['A','B','C','D'])
df
Out[79]:
A B C D
0 1.922779 -0.734365 -2.369809 1.989087
1 -0.656828 0.083079 0.121598 0.609904
2 0.545435 -0.459353 -0.478351 -2.299534
3 -0.411795 0.322631 -0.921530 0.510665
4 -1.582288 -0.825331 0.956622 0.228850
5 1.067413 -0.636496 -2.088929 -1.527931
6 0.211201 0.553021 -0.826861 0.420775
7 -1.010485 -0.810219 -0.444305 -1.868086
s=df.iloc[3]
s
Out[81]:
A -0.411795
B 0.322631
C -0.921530
D 0.510665
Name: 3, dtype: float64
df.append(s,ignore_index=True)
Out[82]:
A B C D
0 1.922779 -0.734365 -2.369809 1.989087
1 -0.656828 0.083079 0.121598 0.609904
2 0.545435 -0.459353 -0.478351 -2.299534
3 -0.411795 0.322631 -0.921530 0.510665
4 -1.582288 -0.825331 0.956622 0.228850
5 1.067413 -0.636496 -2.088929 -1.527931
6 0.211201 0.553021 -0.826861 0.420775
7 -1.010485 -0.810219 -0.444305 -1.868086
8 -0.411795 0.322631 -0.921530 0.510665
分组(Grouping)
“group by” 指的是涵盖下列一项或多项步骤的处理流程:
- 分割:按条件把数据分割成多组;
- 应用:为每组单独应用函数;
- 组合:将处理结果组合成一个数据结构。
df = pd.DataFrame({
'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)
})
df
Out[84]:
A B C D
0 foo one -0.046626 -0.817392
1 bar one 0.071607 -0.774991
2 foo two 1.521541 -1.542514
3 bar three -1.602113 0.631629
4 foo two 1.658801 0.121176
5 bar two 0.068362 -0.166527
6 foo one 0.282646 -0.016375
7 foo three -0.420728 -1.598560
先分组,再用 sum()
函数计算每组的汇总数据:
df.groupby('A').sum()
Out[85]:
C D
A
bar -1.462143 -0.309889
foo 2.995634 -3.853664
多列分组后,生成多层索引,也可以应用 sum
函数:
df.groupby(['A','B']).sum()
Out[87]:
C D
A B
bar one 0.071607 -0.774991
three -1.602113 0.631629
two 0.068362 -0.166527
foo one 0.236019 -0.833767
three -0.420728 -1.598560
two 3.180342 -1.421338
重塑(Reshaping)
堆叠(Stack)
tuples = list(zip(*[
['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two','one', 'two', 'one', 'two']
]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2
Out[92]:
A B
first second
bar one 1.259062 0.869558
two -0.743441 1.722838
baz one 0.015719 1.503939
two 0.570381 0.632574
stack()
方法把 DataFrame 列压缩至一层:
stacked = df2.stack()
stacked
Out[94]:
first second
bar one A 1.259062
B 0.869558
two A -0.743441
B 1.722838
baz one A 0.015719
B 1.503939
two A 0.570381
B 0.632574
dtype: float64
压缩后的 DataFrame 或 Series 具有多层索引, stack()
的逆操作是 unstack()
,默认为拆叠最后一层:
stacked.unstack()
Out[95]:
A B
first second
bar one 1.259062 0.869558
two -0.743441 1.722838
baz one 0.015719 1.503939
two 0.570381 0.632574
stacked.unstack(1)
Out[96]:
second one two
first
bar A 1.259062 -0.743441
B 0.869558 1.722838
baz A 0.015719 0.570381
B 1.503939 0.632574
stacked.unstack(0)
Out[97]:
first bar baz
second
one A 1.259062 0.015719
B 0.869558 1.503939
two A -0.743441 0.570381
B 1.722838 0.632574
数据透视表(Pivot Tables)
df = pd.DataFrame({
'A': ['one', 'one', 'two', 'three'] * 3,
'B': ['A', 'B', 'C'] * 4,
'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D': np.random.randn(12),
'E': np.random.randn(12)
})
df
Out[99]:
A B C D E
0 one A foo 0.056418 -0.830885
1 one B foo -0.364376 0.770207
2 two C foo -0.461330 0.294341
3 three A bar -1.237736 0.336037
4 one B bar -0.232306 0.165646
5 one C bar 0.547340 -1.031499
6 two A foo 0.006193 0.666089
7 three B foo 1.214052 -0.898943
8 one C foo 0.215837 -0.101430
9 one A bar 0.917909 0.721762
10 two B bar -0.359241 1.600898
11 three C bar 1.272114 -0.275911
用上述数据生成数据透视表非常简单:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[100]:
C bar foo
A B
one A 0.917909 0.056418
B -0.232306 -0.364376
C 0.547340 0.215837
three A -1.237736 NaN
B NaN 1.214052
C 1.272114 NaN
two A NaN 0.006193
B -0.359241 NaN
C NaN -0.461330
时间序列(TimeSeries)
Pandas 为频率转换时重采样提供了虽然简单易用,但强大高效的功能,如,将秒级的数据转换为 5 分钟为频率的数据。这种操作常见于财务应用程序,但又不仅限于此。
rng = pd.date_range('1/1/2012', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample('5Min').sum()
Out[103]:
2012-01-01 24434
Freq: 5T, dtype: int32
时区表示:
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts
Out[106]:
2012-03-06 -0.732182
2012-03-07 -0.906948
2012-03-08 -1.093885
2012-03-09 -1.564909
2012-03-10 -0.630042
Freq: D, dtype: float64
ts_utc = ts.tz_localize('UTC')
ts_utc
Out[108]:
2012-03-06 00:00:00+00:00 -0.732182
2012-03-07 00:00:00+00:00 -0.906948
2012-03-08 00:00:00+00:00 -1.093885
2012-03-09 00:00:00+00:00 -1.564909
2012-03-10 00:00:00+00:00 -0.630042
Freq: D, dtype: float64
转换成其它时区:
ts_utc.tz_convert('US/Eastern')
Out[109]:
2012-03-05 19:00:00-05:00 -0.732182
2012-03-06 19:00:00-05:00 -0.906948
2012-03-07 19:00:00-05:00 -1.093885
2012-03-08 19:00:00-05:00 -1.564909
2012-03-09 19:00:00-05:00 -0.630042
Freq: D, dtype: float64
转换时间段:
rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
Out[112]:
2012-01-31 -1.345134
2012-02-29 0.214073
2012-03-31 -0.764937
2012-04-30 -1.346597
2012-05-31 -0.188443
Freq: M, dtype: float64
ps = ts.to_period()
ps
Out[114]:
2012-01 -1.345134
2012-02 0.214073
2012-03 -0.764937
2012-04 -1.346597
2012-05 -0.188443
Freq: M, dtype: float64
ps.to_timestamp()
Out[115]:
2012-01-01 -1.345134
2012-02-01 0.214073
2012-03-01 -0.764937
2012-04-01 -1.346597
2012-05-01 -0.188443
Freq: MS, dtype: float64
Pandas 函数可以很方便地转换时间段与时间戳。下例把以 11 月为结束年份的季度频率转换为下一季度月末上午 9 点:
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()
Out[119]:
1990-03-01 09:00 0.071360
1990-06-01 09:00 -0.164712
1990-09-01 09:00 1.755744
1990-12-01 09:00 0.787537
1991-03-01 09:00 1.148147
Freq: H, dtype: float64
类别型(Categoricals)
Pandas 的 DataFrame 里可以包含类别数据。
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})
将 grade
的原生数据转换为类别型数据:
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
Out[122]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
用有含义的名字重命名不同类型,调用 Series.cat.categories
。
df["grade"].cat.categories = ["very good", "good", "very bad"]
重新排序各类别,并添加缺失类,Series.cat
的方法默认返回新 Series
。
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
Out[125]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
这里是按生成类别时的顺序排序,不是按词汇排序:
df.sort_values(by="grade")
Out[126]:
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
按类列分组(groupby)时,即便某类别为空,也会显示:
df.groupby("grade").size()
Out[127]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
可视化
s = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
Out[130]: <matplotlib.axes._subplots.AxesSubplot at 0x21756380250>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EBQVWPoc-1605948551405)(C:\Users\YANGGEOL\AppData\Roaming\Typora\typora-user-images\image-20201121154100479.png)]
DataFrame 的 plot()
方法可以快速绘制所有带标签的列:
In [138]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
.....: columns=['A', 'B', 'C', 'D'])
.....:
In [139]: df = df.cumsum()
In [140]: plt.figure()
Out[140]: <Figure size 640x480 with 0 Axes>
In [141]: df.plot()
Out[141]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2b53a2d7f0>
In [142]: plt.legend(loc='best')
Out[142]: <matplotlib.legend.Legend at 0x7f2b539728d0>
数据输入 / 输出
CSV
In [143]: df.to_csv('foo.csv')
读取 CSV 文件数据:
In [144]: pd.read_csv('foo.csv')
Out[144]:
Unnamed: 0 A B C D
0 2000-01-01 0.266457 -0.399641 -0.219582 1.186860
1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2 2000-01-03 -1.734933 0.530468 2.060811 -0.515536
3 2000-01-04 -1.555121 1.452620 0.239859 -1.156896
4 2000-01-05 0.578117 0.511371 0.103552 -2.428202
5 2000-01-06 0.478344 0.449933 -0.741620 -1.962409
6 2000-01-07 1.235339 -0.091757 -1.543861 -1.084753
.. ... ... ... ... ...
993 2002-09-20 -10.628548 -9.153563 -7.883146 28.313940
994 2002-09-21 -10.390377 -8.727491 -6.399645 30.914107
995 2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
996 2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
997 2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
998 2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
999 2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 5 columns]
HDF5
写入 HDF5 Store:
In [145]: df.to_hdf('foo.h5', 'df')
读取 HDF5 Store:
In [146]: pd.read_hdf('foo.h5', 'df')
Out[146]:
A B C D
2000-01-01 0.266457 -0.399641 -0.219582 1.186860
2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2000-01-03 -1.734933 0.530468 2.060811 -0.515536
2000-01-04 -1.555121 1.452620 0.239859 -1.156896
2000-01-05 0.578117 0.511371 0.103552 -2.428202
2000-01-06 0.478344 0.449933 -0.741620 -1.962409
2000-01-07 1.235339 -0.091757 -1.543861 -1.084753
... ... ... ... ...
2002-09-20 -10.628548 -9.153563 -7.883146 28.313940
2002-09-21 -10.390377 -8.727491 -6.399645 30.914107
2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 4 columns]
Excel
写入 Excel 文件:
In [147]: df.to_excel('foo.xlsx', sheet_name='Sheet1')
读取 Excel 文件:
In [148]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Out[148]:
Unnamed: 0 A B C D
0 2000-01-01 0.266457 -0.399641 -0.219582 1.186860
1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2 2000-01-03 -1.734933 0.530468 2.060811 -0.515536
3 2000-01-04 -1.555121 1.452620 0.239859 -1.156896
4 2000-01-05 0.578117 0.511371 0.103552 -2.428202
5 2000-01-06 0.478344 0.449933 -0.741620 -1.962409
6 2000-01-07 1.235339 -0.091757 -1.543861 -1.084753
.. ... ... ... ... ...
993 2002-09-20 -10.628548 -9.153563 -7.883146 28.313940
994 2002-09-21 -10.390377 -8.727491 -6.399645 30.914107
995 2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
996 2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
997 2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
998 2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
999 2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 5 columns]