Python-Pandas快速入门学习笔记

安装
pip install Pandas
导入
import numpy as np
import pandas as pd
生成对象

用值列表生成Series时,Pandas 默认自动生成整数索引

s = pd.Series([1,3,5,np.nan,6,8])

s
Out[5]: 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

用含日期时间索引与标签的 NumPy 数组生成 DataFrame

dates = pd.date_range('20130101',periods=6)

dates
Out[7]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

df=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

df
Out[9]: 
                   A         B         C         D
2013-01-01  0.232311 -0.531184  0.884277  1.424803
2013-01-02 -0.481030 -0.102810  0.428284  0.355169
2013-01-03  3.400514 -0.121568 -0.673329 -0.466433
2013-01-04 -0.328470 -2.249995 -1.292375  0.356151
2013-01-05  1.558183  0.805518  0.351855  1.953801
2013-01-06 -1.877377 -0.250330  0.988246 -2.228399

用 Series 字典对象生成 DataFrame

import pandas as pd

import numpy as np

df2 = pd.DataFrame({
'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'
})

df2
Out[6]: 
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

DataFrame 的列有不同数据类型

df2.dtypes
Out[7]: 
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object
查看数据

下列代码说明如何查看 DataFrame 头部和尾部数据:

df.head()
Out[6]: 
                   A         B         C         D
2013-01-01 -0.696592 -0.363500 -0.566725  0.792202
2013-01-02  0.961099  0.181294 -2.301134 -1.677969
2013-01-03 -0.911692 -0.808445 -0.127359 -0.308634
2013-01-04  0.818615 -0.676508 -0.260336  1.047999
2013-01-05  1.124423 -0.315442 -1.457229 -0.007171

df.tail(3)
Out[7]: 
                   A         B         C         D
2013-01-04  0.818615 -0.676508 -0.260336  1.047999
2013-01-05  1.124423 -0.315442 -1.457229 -0.007171
2013-01-06 -1.542175 -1.116586  0.547490  0.945169

显示索引与列名:

df.index
Out[8]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

df.columns
Out[9]: Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy()输出底层数据的 NumPy 对象。DataFrame的列由多种数据类型组成时,该操作耗费系统资源较大,这也是 Pandas 和 NumPy 的本质区别:NumPy 数组只有一种数据类型,DataFrame 每列的数据类型各不相同。调用 DataFrame.to_numpy()时,Pandas 查找支持 DataFrame 里所有数据类型的 NumPy 数据类型。还有一种数据类型是 object,可以把 DataFrame 列里的值强制转换为 Python 对象。

下面的 df 这个 DataFrame里的值都是浮点数,DataFrame.to_numpy() 的操作会很快,而且不复制数据。

df.to_numpy()
Out[10]: 
array([[-0.6965918 , -0.36350027, -0.56672495,  0.79220226],
       [ 0.96109885,  0.18129362, -2.30113415, -1.67796915],
       [-0.91169172, -0.80844536, -0.12735913, -0.30863412],
       [ 0.81861514, -0.67650826, -0.2603358 ,  1.04799941],
       [ 1.12442329, -0.3154423 , -1.45722895, -0.0071707 ],
       [-1.54217514, -1.11658607,  0.54749023,  0.94516896]])

df2 这个DataFrame包含了多种类型,DataFrame.to_numpy()操作就会耗费较多资源。

df2.to_numpy()
Out[12]: 
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

tips

DataFrame.to_numpy()的输出不包含行索引和列标签。

describe()可以快速查看数据的统计摘要:

df.describe()
Out[14]: 
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean  -0.041054 -0.516531 -0.694215  0.131933
std    1.143922  0.452037  1.022941  1.041699
min   -1.542175 -1.116586 -2.301134 -1.677969
25%   -0.857917 -0.775461 -1.234603 -0.233268
50%    0.061012 -0.520004 -0.413530  0.392516
75%    0.925478 -0.327457 -0.160603  0.906927
max    1.124423  0.181294  0.547490  1.047999

转置数据:

df.T
Out[16]: 
   2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
A   -0.696592    0.961099   -0.911692    0.818615    1.124423   -1.542175
B   -0.363500    0.181294   -0.808445   -0.676508   -0.315442   -1.116586
C   -0.566725   -2.301134   -0.127359   -0.260336   -1.457229    0.547490
D    0.792202   -1.677969   -0.308634    1.047999   -0.007171    0.945169

按轴排序:

df.sort_index(axis=1,ascending=False)
Out[17]: 
                   D         C         B         A
2013-01-01  0.792202 -0.566725 -0.363500 -0.696592
2013-01-02 -1.677969 -2.301134  0.181294  0.961099
2013-01-03 -0.308634 -0.127359 -0.808445 -0.911692
2013-01-04  1.047999 -0.260336 -0.676508  0.818615
2013-01-05 -0.007171 -1.457229 -0.315442  1.124423
2013-01-06  0.945169  0.547490 -1.116586 -1.542175

按值排序:

df.sort_values(by='B')
Out[18]: 
                   A         B         C         D
2013-01-06 -1.542175 -1.116586  0.547490  0.945169
2013-01-03 -0.911692 -0.808445 -0.127359 -0.308634
2013-01-04  0.818615 -0.676508 -0.260336  1.047999
2013-01-01 -0.696592 -0.363500 -0.566725  0.792202
2013-01-05  1.124423 -0.315442 -1.457229 -0.007171
2013-01-02  0.961099  0.181294 -2.301134 -1.677969
获取数据

选择单列,产生 Series,与 df.A 等效:

df.A
Out[6]: 
2013-01-01    0.231505
2013-01-02   -0.194741
2013-01-03    0.284820
2013-01-04   -1.834587
2013-01-05    1.334072
2013-01-06    0.211907
Freq: D, Name: A, dtype: float64

df['A']
Out[7]: 
2013-01-01    0.231505
2013-01-02   -0.194741
2013-01-03    0.284820
2013-01-04   -1.834587
2013-01-05    1.334072
2013-01-06    0.211907
Freq: D, Name: A, dtype: float64

用 [ ] 切片行:

df[0:3]
Out[8]: 
                   A         B         C         D
2013-01-01  0.231505  0.220624 -0.478619  0.004278
2013-01-02 -0.194741  0.814497 -0.624392 -0.215743
2013-01-03  0.284820 -2.168773  3.192018 -0.640769

df['20130102':'20130104']
Out[9]: 
                   A         B         C         D
2013-01-02 -0.194741  0.814497 -0.624392 -0.215743
2013-01-03  0.284820 -2.168773  3.192018 -0.640769
2013-01-04 -1.834587  0.889701 -0.869091 -0.445347
按标签选择

用标签提取一行数据:

df.loc[dates[0]]
Out[10]: 
A    0.231505
B    0.220624
C   -0.478619
D    0.004278
Name: 2013-01-01 00:00:00, dtype: float64

用标签选择多列数据:

df.loc[:,['A','B']]
Out[13]: 
                   A         B
2013-01-01  0.231505  0.220624
2013-01-02 -0.194741  0.814497
2013-01-03  0.284820 -2.168773
2013-01-04 -1.834587  0.889701
2013-01-05  1.334072  0.876368
2013-01-06  0.211907 -0.585302

用标签切片,包含行与列结束点:

df.loc['20130102':'20130104',['A','B']]
Out[14]: 
                   A         B
2013-01-02 -0.194741  0.814497
2013-01-03  0.284820 -2.168773
2013-01-04 -1.834587  0.889701

返回对象降维:

df.loc['20130102',['A','B']]
Out[15]: 
A   -0.194741
B    0.814497
Name: 2013-01-02 00:00:00, dtype: float64

提取标量值:

df.loc[dates[0],'A']
Out[16]: 0.23150508311794024

快速访问标量,与上述方法等效:

df.at[dates[0],'A']
Out[17]: 0.23150508311794024
按位置选择

用整数位置选择:

df.iloc[3]
Out[18]: 
A   -1.834587
B    0.889701
C   -0.869091
D   -0.445347
Name: 2013-01-04 00:00:00, dtype: float64

类似 NumPy / Python,用整数切片:

df.iloc[3:5,0:2]
Out[19]: 
                   A         B
2013-01-04 -1.834587  0.889701
2013-01-05  1.334072  0.876368

类似 NumPy / Python,用整数列表按位置切片:

df.iloc[[1,2,4],[0,2]]
Out[20]: 
                   A         C
2013-01-02 -0.194741 -0.624392
2013-01-03  0.284820  3.192018
2013-01-05  1.334072  0.201278

显式整行切片:

df.iloc[1:3,:]
Out[22]: 
                   A         B         C         D
2013-01-02 -0.194741  0.814497 -0.624392 -0.215743
2013-01-03  0.284820 -2.168773  3.192018 -0.640769

显式整列切片:

df.iloc[:,1:3]
Out[23]: 
                   B         C
2013-01-01  0.220624 -0.478619
2013-01-02  0.814497 -0.624392
2013-01-03 -2.168773  3.192018
2013-01-04  0.889701 -0.869091
2013-01-05  0.876368  0.201278
2013-01-06 -0.585302  0.929343

显式提取值:

df.iloc[1,1]
Out[24]: 0.8144969773188735

快速访问标量,与上述方法等效:

df.iat[1,1]
Out[25]: 0.8144969773188735
布尔索引

用单列的值选择数据:

df[df.A>0]
Out[26]: 
                   A         B         C         D
2013-01-01  0.231505  0.220624 -0.478619  0.004278
2013-01-03  0.284820 -2.168773  3.192018 -0.640769
2013-01-05  1.334072  0.876368  0.201278 -0.727291
2013-01-06  0.211907 -0.585302  0.929343 -1.169148

选择 DataFrame 里满足条件的值:

df[df>0]
Out[27]: 
                   A         B         C         D
2013-01-01  0.231505  0.220624       NaN  0.004278
2013-01-02       NaN  0.814497       NaN       NaN
2013-01-03  0.284820       NaN  3.192018       NaN
2013-01-04       NaN  0.889701       NaN       NaN
2013-01-05  1.334072  0.876368  0.201278       NaN
2013-01-06  0.211907       NaN  0.929343       NaN

isin()筛选:

df2 = df.copy()

df2['E'] = [ 'one','one','two','three','four','three']

df2
Out[30]: 
                   A         B         C         D      E
2013-01-01  0.231505  0.220624 -0.478619  0.004278    one
2013-01-02 -0.194741  0.814497 -0.624392 -0.215743    one
2013-01-03  0.284820 -2.168773  3.192018 -0.640769    two
2013-01-04 -1.834587  0.889701 -0.869091 -0.445347  three
2013-01-05  1.334072  0.876368  0.201278 -0.727291   four
2013-01-06  0.211907 -0.585302  0.929343 -1.169148  three

df2[df2['E'].isin(['two','four'])]
Out[31]: 
                   A         B         C         D     E
2013-01-03  0.284820 -2.168773  3.192018 -0.640769   two
2013-01-05  1.334072  0.876368  0.201278 -0.727291  four
赋值

用索引自动对齐新增列的数据:

s1=pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))

s1
Out[33]: 
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

df['F'] = s1

df
Out[35]: 
                   A         B         C         D    F
2013-01-01  0.231505  0.220624 -0.478619  0.004278  NaN
2013-01-02 -0.194741  0.814497 -0.624392 -0.215743  1.0
2013-01-03  0.284820 -2.168773  3.192018 -0.640769  2.0
2013-01-04 -1.834587  0.889701 -0.869091 -0.445347  3.0
2013-01-05  1.334072  0.876368  0.201278 -0.727291  4.0
2013-01-06  0.211907 -0.585302  0.929343 -1.169148  5.0

按标签赋值:

df.at[dates[0], 'A']=0

df
Out[37]: 
                   A         B         C         D    F
2013-01-01  0.000000  0.220624 -0.478619  0.004278  NaN
2013-01-02 -0.194741  0.814497 -0.624392 -0.215743  1.0
2013-01-03  0.284820 -2.168773  3.192018 -0.640769  2.0
2013-01-04 -1.834587  0.889701 -0.869091 -0.445347  3.0
2013-01-05  1.334072  0.876368  0.201278 -0.727291  4.0
2013-01-06  0.211907 -0.585302  0.929343 -1.169148  5.0

按位置赋值:

df.iat[0,1]=0

df
Out[39]: 
                   A         B         C         D    F
2013-01-01  0.000000  0.000000 -0.478619  0.004278  NaN
2013-01-02 -0.194741  0.814497 -0.624392 -0.215743  1.0
2013-01-03  0.284820 -2.168773  3.192018 -0.640769  2.0
2013-01-04 -1.834587  0.889701 -0.869091 -0.445347  3.0
2013-01-05  1.334072  0.876368  0.201278 -0.727291  4.0
2013-01-06  0.211907 -0.585302  0.929343 -1.169148  5.0

按 NumPy 数组赋值:

df.loc[:,'D']=np.array([5]*len(df))

df
Out[41]: 
                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -0.478619  5  NaN
2013-01-02 -0.194741  0.814497 -0.624392  5  1.0
2013-01-03  0.284820 -2.168773  3.192018  5  2.0
2013-01-04 -1.834587  0.889701 -0.869091  5  3.0
2013-01-05  1.334072  0.876368  0.201278  5  4.0
2013-01-06  0.211907 -0.585302  0.929343  5  5.0

where 条件赋值:

df2=df.copy()

df[df2>0]=-df

df
Out[45]: 
                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -0.478619 -5  NaN
2013-01-02 -0.194741 -0.814497 -0.624392 -5 -1.0
2013-01-03 -0.284820 -2.168773 -3.192018 -5 -2.0
2013-01-04 -1.834587 -0.889701 -0.869091 -5 -3.0
2013-01-05 -1.334072 -0.876368 -0.201278 -5 -4.0
2013-01-06 -0.211907 -0.585302 -0.929343 -5 -5.0
缺失值

Pandas 主要用 np.nan 表示缺失数据。 计算时,默认不包含空值。

重建索引(reindex)可以更改、添加、删除指定轴的索引,并返回数据副本,即不更改原数据。

df1=df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])

df1.loc[dates[0]:dates[1],'E'] = 1

df1
Out[49]: 
                   A         B         C  D    F    E
2013-01-01  0.000000  0.000000 -0.478619 -5  NaN  1.0
2013-01-02 -0.194741 -0.814497 -0.624392 -5 -1.0  1.0
2013-01-03 -0.284820 -2.168773 -3.192018 -5 -2.0  NaN
2013-01-04 -1.834587 -0.889701 -0.869091 -5 -3.0  NaN

删除所有含缺失值的行:

df1.dropna(how='any')
Out[50]: 
                   A         B         C  D    F    E
2013-01-02 -0.194741 -0.814497 -0.624392 -5 -1.0  1.0

填充缺失值:

df.fillna(value=5)
Out[52]: 
                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -0.478619 -5  5.0
2013-01-02 -0.194741 -0.814497 -0.624392 -5 -1.0
2013-01-03 -0.284820 -2.168773 -3.192018 -5 -2.0
2013-01-04 -1.834587 -0.889701 -0.869091 -5 -3.0
2013-01-05 -1.334072 -0.876368 -0.201278 -5 -4.0
2013-01-06 -0.211907 -0.585302 -0.929343 -5 -5.0

提取 nan 值的布尔掩码:

pd.isna(df1)
Out[53]: 
                A      B      C      D      F      E
2013-01-01  False  False  False  False   True  False
2013-01-02  False  False  False  False  False  False
2013-01-03  False  False  False  False  False   True
2013-01-04  False  False  False  False  False   True
统计

一般情况下,运算时排除缺失值。

描述性统计:

df.mean()
Out[54]: 
A   -0.643355
B   -0.889107
C   -1.049123
D   -5.000000
F   -3.000000
dtype: float64

在另一个轴(即,行)上执行同样的操作:

df.mean(1)
Out[55]: 
2013-01-01   -1.369655
2013-01-02   -1.526726
2013-01-03   -2.529122
2013-01-04   -2.318676
2013-01-05   -2.282344
2013-01-06   -2.345311
Freq: D, dtype: float64

不同维度对象运算时,要先对齐。 此外,Pandas 自动沿指定维度广播。

s=pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)

s
Out[58]: 
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

df.sub(s,axis='index')
Out[59]: 
                   A         B         C     D    F
2013-01-01       NaN       NaN       NaN   NaN  NaN
2013-01-02       NaN       NaN       NaN   NaN  NaN
2013-01-03 -1.284820 -3.168773 -4.192018  -6.0 -3.0
2013-01-04 -4.834587 -3.889701 -3.869091  -8.0 -6.0
2013-01-05 -6.334072 -5.876368 -5.201278 -10.0 -9.0
2013-01-06       NaN       NaN       NaN   NaN  NaN
Apply 函数

Apply 函数处理数据:

df.apply(np.cumsum)
Out[60]: 
                   A         B         C   D     F
2013-01-01  0.000000  0.000000 -0.478619  -5   NaN
2013-01-02 -0.194741 -0.814497 -1.103010 -10  -1.0
2013-01-03 -0.479562 -2.983270 -4.295029 -15  -3.0
2013-01-04 -2.314148 -3.872971 -5.164120 -20  -6.0
2013-01-05 -3.648220 -4.749339 -5.365398 -25 -10.0
2013-01-06 -3.860128 -5.334641 -6.294741 -30 -15.0

df.apply(lambda x:x.max()-x.min())
Out[61]: 
A    1.834587
B    2.168773
C    2.990740
D    0.000000
F    4.000000
dtype: float64
直方图
s = pd.Series(np.random.randint(0,7,size=10))

s
Out[63]: 
0    4
1    2
2    2
3    5
4    4
5    5
6    2
7    2
8    5
9    2
dtype: int32

s.value_counts()
Out[64]: 
2    5
5    3
4    2
dtype: int64
字符串方法

Series 的 str 属性包含一组字符串处理功能,如下列代码所示。注意,str 的模式匹配默认使用正则表达式。

s = pd.Series(['A','B','C','Aaba',np.nan,'CABA','dog','cat'])

s.str.lower()
Out[68]: 
0       a
1       b
2       c
3    aaba
4     NaN
5    caba
6     dog
7     cat
dtype: object
合并(Merge)
结合(Concat)

Pandas 提供了多种将 Series、DataFrame 对象组合在一起的功能,用索引与关联代数功能的多种设置逻辑可执行连接(join)与合并(merge)操作。

concat() 用于连接 Pandas 对象:

df = pd.DataFrame(np.random.randn(10,4))

df
Out[70]: 
          0         1         2         3
0 -1.994210 -1.916873  0.533659  0.049655
1 -1.775216 -0.077889 -0.371278  0.266553
2 -0.009074 -2.131895  0.595984  0.025998
3  0.267357  1.966662  0.747365  0.319554
4 -0.107444  0.857005  0.089416  0.437403
5  2.031898 -0.059433 -1.418391 -0.596198
6 -0.654181  0.444417  1.479503  1.012438
7 -0.208141  1.865477 -0.385802 -0.980770
8 -0.346811  0.890188  1.416453  0.476470
9  0.069049  0.249389  2.107487 -0.666272

pieces = [df[0:3],df[3:7],df[7:]]

pd.concat(pieces)
Out[72]: 
          0         1         2         3
0 -1.994210 -1.916873  0.533659  0.049655
1 -1.775216 -0.077889 -0.371278  0.266553
2 -0.009074 -2.131895  0.595984  0.025998
3  0.267357  1.966662  0.747365  0.319554
4 -0.107444  0.857005  0.089416  0.437403
5  2.031898 -0.059433 -1.418391 -0.596198
6 -0.654181  0.444417  1.479503  1.012438
7 -0.208141  1.865477 -0.385802 -0.980770
8 -0.346811  0.890188  1.416453  0.476470
9  0.069049  0.249389  2.107487 -0.666272
连接(join)

SQL 风格的合并。

left = pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})

right = pd.DataFrame({'key':['foo','foo'],'rval':[4,5]})

left
Out[75]: 
   key  lval
0  foo     1
1  foo     2

right
Out[76]: 
   key  rval
0  foo     4
1  foo     5

pd.merge(left,right,on='key')
Out[77]: 
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5
追加(Append)

为 DataFrame 追加行。

df = pd.DataFrame(np.random.randn(8,4),columns=['A','B','C','D'])

df
Out[79]: 
          A         B         C         D
0  1.922779 -0.734365 -2.369809  1.989087
1 -0.656828  0.083079  0.121598  0.609904
2  0.545435 -0.459353 -0.478351 -2.299534
3 -0.411795  0.322631 -0.921530  0.510665
4 -1.582288 -0.825331  0.956622  0.228850
5  1.067413 -0.636496 -2.088929 -1.527931
6  0.211201  0.553021 -0.826861  0.420775
7 -1.010485 -0.810219 -0.444305 -1.868086

s=df.iloc[3]

s
Out[81]: 
A   -0.411795
B    0.322631
C   -0.921530
D    0.510665
Name: 3, dtype: float64

df.append(s,ignore_index=True)
Out[82]: 
          A         B         C         D
0  1.922779 -0.734365 -2.369809  1.989087
1 -0.656828  0.083079  0.121598  0.609904
2  0.545435 -0.459353 -0.478351 -2.299534
3 -0.411795  0.322631 -0.921530  0.510665
4 -1.582288 -0.825331  0.956622  0.228850
5  1.067413 -0.636496 -2.088929 -1.527931
6  0.211201  0.553021 -0.826861  0.420775
7 -1.010485 -0.810219 -0.444305 -1.868086
8 -0.411795  0.322631 -0.921530  0.510665
分组(Grouping)

“group by” 指的是涵盖下列一项或多项步骤的处理流程:

  • 分割:按条件把数据分割成多组;
  • 应用:为每组单独应用函数;
  • 组合:将处理结果组合成一个数据结构。
df = pd.DataFrame({
'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)
})

df
Out[84]: 
     A      B         C         D
0  foo    one -0.046626 -0.817392
1  bar    one  0.071607 -0.774991
2  foo    two  1.521541 -1.542514
3  bar  three -1.602113  0.631629
4  foo    two  1.658801  0.121176
5  bar    two  0.068362 -0.166527
6  foo    one  0.282646 -0.016375
7  foo  three -0.420728 -1.598560

先分组,再用 sum()函数计算每组的汇总数据:

df.groupby('A').sum()
Out[85]: 
            C         D
A                      
bar -1.462143 -0.309889
foo  2.995634 -3.853664

多列分组后,生成多层索引,也可以应用 sum 函数:

df.groupby(['A','B']).sum()
Out[87]: 
                  C         D
A   B                        
bar one    0.071607 -0.774991
    three -1.602113  0.631629
    two    0.068362 -0.166527
foo one    0.236019 -0.833767
    three -0.420728 -1.598560
    two    3.180342 -1.421338
重塑(Reshaping)
堆叠(Stack)
tuples = list(zip(*[
['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two','one', 'two', 'one', 'two']
]))

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

df2 = df[:4]

df2
Out[92]: 
                     A         B
first second                    
bar   one     1.259062  0.869558
      two    -0.743441  1.722838
baz   one     0.015719  1.503939
      two     0.570381  0.632574

stack()方法把 DataFrame 列压缩至一层:

stacked = df2.stack()

stacked
Out[94]: 
first  second   
bar    one     A    1.259062
               B    0.869558
       two     A   -0.743441
               B    1.722838
baz    one     A    0.015719
               B    1.503939
       two     A    0.570381
               B    0.632574
dtype: float64

压缩后的 DataFrame 或 Series 具有多层索引, stack() 的逆操作是 unstack(),默认为拆叠最后一层:

stacked.unstack()
Out[95]: 
                     A         B
first second                    
bar   one     1.259062  0.869558
      two    -0.743441  1.722838
baz   one     0.015719  1.503939
      two     0.570381  0.632574

stacked.unstack(1)
Out[96]: 
second        one       two
first                      
bar   A  1.259062 -0.743441
      B  0.869558  1.722838
baz   A  0.015719  0.570381
      B  1.503939  0.632574

stacked.unstack(0)
Out[97]: 
first          bar       baz
second                      
one    A  1.259062  0.015719
       B  0.869558  1.503939
two    A -0.743441  0.570381
       B  1.722838  0.632574
数据透视表(Pivot Tables)
df = pd.DataFrame({
'A': ['one', 'one', 'two', 'three'] * 3,
'B': ['A', 'B', 'C'] * 4,
'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D': np.random.randn(12),
'E': np.random.randn(12)
})

df
Out[99]: 
        A  B    C         D         E
0     one  A  foo  0.056418 -0.830885
1     one  B  foo -0.364376  0.770207
2     two  C  foo -0.461330  0.294341
3   three  A  bar -1.237736  0.336037
4     one  B  bar -0.232306  0.165646
5     one  C  bar  0.547340 -1.031499
6     two  A  foo  0.006193  0.666089
7   three  B  foo  1.214052 -0.898943
8     one  C  foo  0.215837 -0.101430
9     one  A  bar  0.917909  0.721762
10    two  B  bar -0.359241  1.600898
11  three  C  bar  1.272114 -0.275911

用上述数据生成数据透视表非常简单:

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[100]: 
C             bar       foo
A     B                    
one   A  0.917909  0.056418
      B -0.232306 -0.364376
      C  0.547340  0.215837
three A -1.237736       NaN
      B       NaN  1.214052
      C  1.272114       NaN
two   A       NaN  0.006193
      B -0.359241       NaN
      C       NaN -0.461330
时间序列(TimeSeries)

Pandas 为频率转换时重采样提供了虽然简单易用,但强大高效的功能,如,将秒级的数据转换为 5 分钟为频率的数据。这种操作常见于财务应用程序,但又不仅限于此。

rng = pd.date_range('1/1/2012', periods=100, freq='S')

ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

ts.resample('5Min').sum()
Out[103]: 
2012-01-01    24434
Freq: 5T, dtype: int32

时区表示:

rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')

ts = pd.Series(np.random.randn(len(rng)), rng)

ts
Out[106]: 
2012-03-06   -0.732182
2012-03-07   -0.906948
2012-03-08   -1.093885
2012-03-09   -1.564909
2012-03-10   -0.630042
Freq: D, dtype: float64

ts_utc = ts.tz_localize('UTC')

ts_utc
Out[108]: 
2012-03-06 00:00:00+00:00   -0.732182
2012-03-07 00:00:00+00:00   -0.906948
2012-03-08 00:00:00+00:00   -1.093885
2012-03-09 00:00:00+00:00   -1.564909
2012-03-10 00:00:00+00:00   -0.630042
Freq: D, dtype: float64

转换成其它时区:

ts_utc.tz_convert('US/Eastern')
Out[109]: 
2012-03-05 19:00:00-05:00   -0.732182
2012-03-06 19:00:00-05:00   -0.906948
2012-03-07 19:00:00-05:00   -1.093885
2012-03-08 19:00:00-05:00   -1.564909
2012-03-09 19:00:00-05:00   -0.630042
Freq: D, dtype: float64

转换时间段:

rng = pd.date_range('1/1/2012', periods=5, freq='M')

ts = pd.Series(np.random.randn(len(rng)), index=rng)

ts
Out[112]: 
2012-01-31   -1.345134
2012-02-29    0.214073
2012-03-31   -0.764937
2012-04-30   -1.346597
2012-05-31   -0.188443
Freq: M, dtype: float64

ps = ts.to_period()

ps
Out[114]: 
2012-01   -1.345134
2012-02    0.214073
2012-03   -0.764937
2012-04   -1.346597
2012-05   -0.188443
Freq: M, dtype: float64

ps.to_timestamp()
Out[115]: 
2012-01-01   -1.345134
2012-02-01    0.214073
2012-03-01   -0.764937
2012-04-01   -1.346597
2012-05-01   -0.188443
Freq: MS, dtype: float64

Pandas 函数可以很方便地转换时间段与时间戳。下例把以 11 月为结束年份的季度频率转换为下一季度月末上午 9 点:

prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')

ts = pd.Series(np.random.randn(len(prng)), prng)

ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

ts.head()
Out[119]: 
1990-03-01 09:00    0.071360
1990-06-01 09:00   -0.164712
1990-09-01 09:00    1.755744
1990-12-01 09:00    0.787537
1991-03-01 09:00    1.148147
Freq: H, dtype: float64
类别型(Categoricals)

Pandas 的 DataFrame 里可以包含类别数据。

 df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})

grade 的原生数据转换为类别型数据:

df["grade"] = df["raw_grade"].astype("category")

df["grade"]
Out[122]: 
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

用有含义的名字重命名不同类型,调用 Series.cat.categories

df["grade"].cat.categories = ["very good", "good", "very bad"]

重新排序各类别,并添加缺失类,Series.cat 的方法默认返回新 Series

df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])

df["grade"]
Out[125]: 
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

这里是按生成类别时的顺序排序,不是按词汇排序:

df.sort_values(by="grade")
Out[126]: 
   id raw_grade      grade
5   6         e   very bad
1   2         b       good
2   3         b       good
0   1         a  very good
3   4         a  very good
4   5         a  very good

按类列分组(groupby)时,即便某类别为空,也会显示:

df.groupby("grade").size()
Out[127]: 
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64
可视化
s = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2000', periods=1000))

ts = ts.cumsum()

ts.plot()
Out[130]: <matplotlib.axes._subplots.AxesSubplot at 0x21756380250>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EBQVWPoc-1605948551405)(C:\Users\YANGGEOL\AppData\Roaming\Typora\typora-user-images\image-20201121154100479.png)]

DataFrame 的 plot() 方法可以快速绘制所有带标签的列:

In [138]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
   .....:                   columns=['A', 'B', 'C', 'D'])
   .....: 

In [139]: df = df.cumsum()

In [140]: plt.figure()
Out[140]: <Figure size 640x480 with 0 Axes>

In [141]: df.plot()
Out[141]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2b53a2d7f0>

In [142]: plt.legend(loc='best')
Out[142]: <matplotlib.legend.Legend at 0x7f2b539728d0>

可视化2

数据输入 / 输出
CSV
In [143]: df.to_csv('foo.csv')

读取 CSV 文件数据:

In [144]: pd.read_csv('foo.csv')
Out[144]: 
     Unnamed: 0          A          B         C          D
0    2000-01-01   0.266457  -0.399641 -0.219582   1.186860
1    2000-01-02  -1.170732  -0.345873  1.653061  -0.282953
2    2000-01-03  -1.734933   0.530468  2.060811  -0.515536
3    2000-01-04  -1.555121   1.452620  0.239859  -1.156896
4    2000-01-05   0.578117   0.511371  0.103552  -2.428202
5    2000-01-06   0.478344   0.449933 -0.741620  -1.962409
6    2000-01-07   1.235339  -0.091757 -1.543861  -1.084753
..          ...        ...        ...       ...        ...
993  2002-09-20 -10.628548  -9.153563 -7.883146  28.313940
994  2002-09-21 -10.390377  -8.727491 -6.399645  30.914107
995  2002-09-22  -8.985362  -8.485624 -4.669462  31.367740
996  2002-09-23  -9.558560  -8.781216 -4.499815  30.518439
997  2002-09-24  -9.902058  -9.340490 -4.386639  30.105593
998  2002-09-25 -10.216020  -9.480682 -3.933802  29.758560
999  2002-09-26 -11.856774 -10.671012 -3.216025  29.369368

[1000 rows x 5 columns]
HDF5

写入 HDF5 Store:

In [145]: df.to_hdf('foo.h5', 'df')

读取 HDF5 Store:

In [146]: pd.read_hdf('foo.h5', 'df')
Out[146]: 
                    A          B         C          D
2000-01-01   0.266457  -0.399641 -0.219582   1.186860
2000-01-02  -1.170732  -0.345873  1.653061  -0.282953
2000-01-03  -1.734933   0.530468  2.060811  -0.515536
2000-01-04  -1.555121   1.452620  0.239859  -1.156896
2000-01-05   0.578117   0.511371  0.103552  -2.428202
2000-01-06   0.478344   0.449933 -0.741620  -1.962409
2000-01-07   1.235339  -0.091757 -1.543861  -1.084753
...               ...        ...       ...        ...
2002-09-20 -10.628548  -9.153563 -7.883146  28.313940
2002-09-21 -10.390377  -8.727491 -6.399645  30.914107
2002-09-22  -8.985362  -8.485624 -4.669462  31.367740
2002-09-23  -9.558560  -8.781216 -4.499815  30.518439
2002-09-24  -9.902058  -9.340490 -4.386639  30.105593
2002-09-25 -10.216020  -9.480682 -3.933802  29.758560
2002-09-26 -11.856774 -10.671012 -3.216025  29.369368

[1000 rows x 4 columns]
Excel

写入 Excel 文件:

In [147]: df.to_excel('foo.xlsx', sheet_name='Sheet1')

读取 Excel 文件:

In [148]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Out[148]: 
    Unnamed: 0          A          B         C          D
0   2000-01-01   0.266457  -0.399641 -0.219582   1.186860
1   2000-01-02  -1.170732  -0.345873  1.653061  -0.282953
2   2000-01-03  -1.734933   0.530468  2.060811  -0.515536
3   2000-01-04  -1.555121   1.452620  0.239859  -1.156896
4   2000-01-05   0.578117   0.511371  0.103552  -2.428202
5   2000-01-06   0.478344   0.449933 -0.741620  -1.962409
6   2000-01-07   1.235339  -0.091757 -1.543861  -1.084753
..         ...        ...        ...       ...        ...
993 2002-09-20 -10.628548  -9.153563 -7.883146  28.313940
994 2002-09-21 -10.390377  -8.727491 -6.399645  30.914107
995 2002-09-22  -8.985362  -8.485624 -4.669462  31.367740
996 2002-09-23  -9.558560  -8.781216 -4.499815  30.518439
997 2002-09-24  -9.902058  -9.340490 -4.386639  30.105593
998 2002-09-25 -10.216020  -9.480682 -3.933802  29.758560
999 2002-09-26 -11.856774 -10.671012 -3.216025  29.369368

[1000 rows x 5 columns]
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值