本文使用pandas最新版本0.25.3
验证。
pandas安装命令如下:
pip install pandas
如果不是最新版本,建议升级至最新版本,版本升级命令如下:
python -m pip install --upgrade pandas
首先导入pandas包,numpy包经常一起使用,一同导入
In [1]: import pandas as pd
In [2]: import numpy as np
创建对象
创建Series
In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8])
In [4]: s
Out[4]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
通过N维数组创建DataFrame
In [5]: dates = pd.date_range('20130101', periods=6)
In [6]: dates
Out[6]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [7]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
In [8]: df
Out[8]:
A B C D
2013-01-01 -0.343288 -0.127315 1.346011 -0.815653
2013-01-02 1.265971 -0.922331 0.306345 -0.459836
2013-01-03 0.666318 -0.548572 -0.301053 -0.093589
2013-01-04 -0.980425 -0.069352 1.082163 0.507438
2013-01-05 0.837017 1.324691 -1.912240 -0.736096
2013-01-06 -0.620211 -1.000688 0.626714 -0.081108
通过字典创建DataFrame
In [9]: df2 = pd.DataFrame({'A': 1.,
...: 'B': pd.Timestamp('20130102'),
...: 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
...: 'D': np.array([3]*4, dtype='int32'),
...: 'E': pd.Categorical(["test", "train", "test", "train"]),
...: 'F': 'foo'})
In [10]: df2
Out[10]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
DataFrame不同列拥有不同的数据类型
In [11]: df2.dtypes
Out[11]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
DataFrame支持tab
键,输入df2.后按tab
键选择不同的计算功能
In [12]: df2.abs
Out[12]:
<bound method NDFrame.abs of A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo>
查看数据
查看前几行、后几行,head()默认呈现前5行
In [13]: df.head()
Out[13]:
A B C D
2013-01-01 -0.343288 -0.127315 1.346011 -0.815653
2013-01-02 1.265971 -0.922331 0.306345 -0.459836
2013-01-03 0.666318 -0.548572 -0.301053 -0.093589
2013-01-04 -0.980425 -0.069352 1.082163 0.507438
2013-01-05 0.837017 1.324691 -1.912240 -0.736096
In [14]: df.tail(3)
Out[14]:
A B C D
2013-01-04 -0.980425 -0.069352 1.082163 0.507438
2013-01-05 0.837017 1.324691 -1.912240 -0.736096
2013-01-06 -0.620211 -1.000688 0.626714 -0.081108
查看索引,列
In [15]: df.index
Out[15]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [16]: df.columns
Out[16]: Index(['A', 'B', 'C', 'D'], dtype='object')
常规numpy
数据是一种数据类型,而pandas DataFrames
每一列有一种数据类型,使用DataFrame.to_numpy()
时,将保持所有的数据类型不变,但是,该转换输出结果不包含索引和标签。
In [17]: df.to_numpy()
Out[17]:
array([[-0.34328818, -0.12731538, 1.34601055, -0.81565252],
[ 1.2659709 , -0.922331 , 0.30634462, -0.45983615],
[ 0.66631828, -0.54857206, -0.30105312, -0.09358928],
[-0.98042489, -0.06935154, 1.08216265, 0.50743768],
[ 0.83701695, 1.32469123, -1.91224007, -0.73609627],
[-0.62021063, -1.00068753, 0.62671423, -0.08110795]])
In [18]: df2.to_numpy()
Out[18]:
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
获取数据概览
In [19]: df.describe()
Out[19]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.137564 -0.223928 0.191323 -0.279807
std 0.905258 0.851818 1.182971 0.494298
min -0.980425 -1.000688 -1.912240 -0.815653
25% -0.550980 -0.828891 -0.149204 -0.667031
50% 0.161515 -0.337944 0.466529 -0.276713
75% 0.794342 -0.083843 0.968301 -0.084228
max 1.265971 1.324691 1.346011 0.507438
数据转置
In [20]: df.T
Out[20]:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A -0.343288 1.265971 0.666318 -0.980425 0.837017 -0.620211
B -0.127315 -0.922331 -0.548572 -0.069352 1.324691 -1.000688
C 1.346011 0.306345 -0.301053 1.082163 -1.912240 0.626714
D -0.815653 -0.459836 -0.093589 0.507438 -0.736096 -0.081108
按轴排序,axis=0代表往跨行(down),而axis=1代表跨列(across)
In [21]: df.sort_index(axis=1, ascending=False)
Out[21]:
D C B A
2013-01-01 -0.815653 1.346011 -0.127315 -0.343288
2013-01-02 -0.459836 0.306345 -0.922331 1.265971
2013-01-03 -0.093589 -0.301053 -0.548572 0.666318
2013-01-04 0.507438 1.082163 -0.069352 -0.980425
2013-01-05 -0.736096 -1.912240 1.324691 0.837017
2013-01-06 -0.081108 0.626714 -1.000688 -0.620211
按值排序
In [22]: df.sort_values(by='B')
Out[22]:
A B C D
2013-01-06 -0.620211 -1.000688 0.626714 -0.081108
2013-01-02 1.265971 -0.922331 0.306345 -0.459836
2013-01-03 0.666318 -0.548572 -0.301053 -0.093589
2013-01-01 -0.343288 -0.127315 1.346011 -0.815653
2013-01-04 -0.980425 -0.069352 1.082163 0.507438
2013-01-05 0.837017 1.324691 -1.912240 -0.736096
选择对象
选择单列,返回Series
,等效于df.A
In [23]: df['A']
Out[23]:
2013-01-01 -0.343288
2013-01-02 1.265971
2013-01-03 0.666318
2013-01-04 -0.980425
2013-01-05 0.837017
2013-01-06 -0.620211
Freq: D, Name: A, dtype: float64
通过切片[ ]
选择行
In [24]: df[0:3]
Out[24]:
A B C D
2013-01-01 -0.343288 -0.127315 1.346011 -0.815653
2013-01-02 1.265971 -0.922331 0.306345 -0.459836
2013-01-03 0.666318 -0.548572 -0.301053 -0.093589
In [25]: df['20130102':'20130104']
Out[25]:
A B C D
2013-01-02 1.265971 -0.922331 0.306345 -0.459836
2013-01-03 0.666318 -0.548572 -0.301053 -0.093589
2013-01-04 -0.980425 -0.069352 1.082163 0.507438
通过标签选择
-
loc:通过行标签索引数据
-
iloc:通过行号索引行数据
-
ix:通过行标签或行号索引数据(基于loc和iloc的混合)
使用标签获取某行数据
In [26]: df.loc[dates[0]]
Out[26]:
A -0.343288
B -0.127315
C 1.346011
D -0.815653
Name: 2013-01-01 00:00:00, dtype: float64
多重轴选择
In [27]: df.loc[:, ['A', 'B']]
Out[27]:
A B
2013-01-01 -0.343288 -0.127315
2013-01-02 1.265971 -0.922331
2013-01-03 0.666318 -0.548572
2013-01-04 -0.980425 -0.069352
2013-01-05 0.837017 1.324691
2013-01-06 -0.620211 -1.000688
多维切片选择数据
In [28]: df.loc['20130102':'20130104', ['A', 'B']]
Out[28]:
A B
2013-01-02 1.265971 -0.922331
2013-01-03 0.666318 -0.548572
2013-01-04 -0.980425 -0.069352
数据降维
In [29]: df.loc['20130102', ['A', 'B']]
Out[29]:
A 1.265971
B -0.922331
Name: 2013-01-02 00:00:00, dtype: float64
获取某个标量值
In [30]: df.loc[dates[0], 'A']
Out[30]: -0.34328817932138245
快速访问某标量值,等效与上一个方法
In [31]: df.at[dates[0], 'A']
Out[31]: -0.34328817932138245
按位置选择
通过位置数字选择
In [32]: df.iloc[3]
Out[32]:
A -0.980425
B -0.069352
C 1.082163
D 0.507438
Name: 2013-01-04 00:00:00, dtype: float64
通过切片选择数据
In [33]: df.iloc[3:5, 0:2]
Out[33]:
A B
2013-01-04 -0.980425 -0.069352
2013-01-05 0.837017 1.324691
通过位置列表选择
In [34]: df.iloc[[1, 2, 4], [0, 2]]
Out[34]:
A C
2013-01-02 1.265971 0.306345
2013-01-03 0.666318 -0.301053
2013-01-05 0.837017 -1.912240
切片选择特定的行
In [35]: df.iloc[1:3, :]
Out[35]:
A B C D
2013-01-02 1.265971 -0.922331 0.306345 -0.459836
2013-01-03 0.666318 -0.548572 -0.301053 -0.093589
切片选择特定的列
In [36]: df.iloc[:, 1:3]
Out[36]:
B C
2013-01-01 -0.127315 1.346011
2013-01-02 -0.922331 0.306345
2013-01-03 -0.548572 -0.301053
2013-01-04 -0.069352 1.082163
2013-01-05 1.324691 -1.912240
2013-01-06 -1.000688 0.626714
获取某个位置的元素
In [37]: df.iloc[1, 1]
Out[37]: -0.9223310024162942
快速访问某个位置的元素,等效于上一个方法
In [38]: df.iat[1, 1]
Out[38]: -0.9223310024162942
布尔型索引选择数据
使用单列数据范围选择数据
In [39]: df[df.A > 0]
Out[39]:
A B C D
2013-01-02 1.265971 -0.922331 0.306345 -0.459836
2013-01-03 0.666318 -0.548572 -0.301053 -0.093589
2013-01-05 0.837017 1.324691 -1.912240 -0.736096
从DataFrame
选择匹配布尔条件的值
In [40]: df[df > 0]
Out[40]:
A B C D
2013-01-01 NaN NaN 1.346011 NaN
2013-01-02 1.265971 NaN 0.306345 NaN
2013-01-03 0.666318 NaN NaN NaN
2013-01-04 NaN NaN 1.082163 0.507438
2013-01-05 0.837017 1.324691 NaN NaN
2013-01-06 NaN NaN 0.626714 NaN
使用isin()
方法过滤数据
In [41]: df2 = df.copy()
In [42]: df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
In [43]: df2
Out[43]:
A B C D E
2013-01-01 -0.343288 -0.127315 1.346011 -0.815653 one
2013-01-02 1.265971 -0.922331 0.306345 -0.459836 one
2013-01-03 0.666318 -0.548572 -0.301053 -0.093589 two
2013-01-04 -0.980425 -0.069352 1.082163 0.507438 three
2013-01-05 0.837017 1.324691 -1.912240 -0.736096 four
2013-01-06 -0.620211 -1.000688 0.626714 -0.081108 three
In [44]: df2[df2['E'].isin(['two', 'four'])]
Out[44]:
A B C D E
2013-01-03 0.666318 -0.548572 -0.301053 -0.093589 two
2013-01-05 0.837017 1.324691 -1.912240 -0.736096 four
设置和赋值数据
按照索引自动对齐数据,生成新列
In [46]: s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102',periods=6))
In [47]: s1
Out[47]:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
In [48]: df['F'] = s1
使用标签赋值
In [49]: df.at[dates[0], 'A'] = 0
使用位置赋值
In [50]: df.iat[0, 1] = 0
使用NumPy
数组赋值
In [52]: df.loc[:, 'D'] = np.array([5] * len(df))
赋值结果如下:
In [53]: df
Out[53]:
A B C D F
2013-01-01 0.000000 0.000000 1.346011 5 NaN
2013-01-02 1.265971 -0.922331 0.306345 5 1.0
2013-01-03 0.666318 -0.548572 -0.301053 5 2.0
2013-01-04 -0.980425 -0.069352 1.082163 5 3.0
2013-01-05 0.837017 1.324691 -1.912240 5 4.0
2013-01-06 -0.620211 -1.000688 0.626714 5 5.0
使用where
条件赋值
In [54]: df2 = df.copy()
In [55]: df2[df2 > 0] = -df2
In [56]: df2
Out[56]:
A B C D F
2013-01-01 0.000000 0.000000 -1.346011 -5 NaN
2013-01-02 -1.265971 -0.922331 -0.306345 -5 -1.0
2013-01-03 -0.666318 -0.548572 -0.301053 -5 -2.0
2013-01-04 -0.980425 -0.069352 -1.082163 -5 -3.0
2013-01-05 -0.837017 -1.324691 -1.912240 -5 -4.0
2013-01-06 -0.620211 -1.000688 -0.626714 -5 -5.0
未完待续