二、基本功能
1. 重建索引
reindex是pandas对象的重要方法,该方法用于创建一个符合新索引的新对象:
import numpy as np
import pandas as pd
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
对于顺序数据,比如时间序列,在重建索引时可能会需要进行插值或者填值。method可选参数允许我们使用诸如ffill等方法在重建索引时插值,ffill方法会将值前向填充:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3
obj3.reindex(range(6), method='ffill')
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
在dataframe中,reindex可以改变行索引、列索引、也可以同时改变二者。当仅传入一个序列时,结果中的行会重建索引:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
index=['a', 'c', 'd'],
columns=['Ohio', 'Texas', 'California'])
frame
| Ohio | Texas | California |
---|
a | 0 | 1 | 2 |
---|
c | 3 | 4 | 5 |
---|
d | 6 | 7 | 8 |
---|
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
| Ohio | Texas | California |
---|
a | 0.0 | 1.0 | 2.0 |
---|
b | NaN | NaN | NaN |
---|
c | 3.0 | 4.0 | 5.0 |
---|
d | 6.0 | 7.0 | 8.0 |
---|
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)
| Texas | Utah | California |
---|
a | 1 | NaN | 2 |
---|
c | 4 | NaN | 5 |
---|
d | 7 | NaN | 8 |
---|
你可以用loc进行更为简洁的标签索引:
frame.loc[['a', 'b', 'c', 'd'], states]
| Texas | Utah | California |
---|
a | 1.0 | NaN | 2.0 |
---|
b | NaN | NaN | NaN |
---|
c | 4.0 | NaN | 5.0 |
---|
d | 7.0 | NaN | 8.0 |
---|
reindex方法的参数:
参数 | 说明 |
---|
index | 新建作为索引的序列 |
method | 填充方式,ffill为前向填充,bfill是后向填充 |
fill_value | 引入的缺失数据值 |
limit | 填充间隙 |
tolerance | 所需填充的不精确匹配下的最大尺寸间隙 |
level | 在多层索引上匹配简单索引 |
copy | 如果weiTrue则复制数据 |
2. 轴向上删除条目
drop方法会返回一个含有指示值或轴向上删除值的新对象:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
new_obj = obj.drop('c')
new_obj
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
obj.drop(['d', 'c'])
a 0.0
b 1.0
e 4.0
dtype: float64
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
| one | two | three | four |
---|
Ohio | 0 | 1 | 2 | 3 |
---|
Colorado | 4 | 5 | 6 | 7 |
---|
Utah | 8 | 9 | 10 | 11 |
---|
New York | 12 | 13 | 14 | 15 |
---|
调用drop时使用标签序列会根据行标签删除值(轴0):
data.drop(['Colorado', 'Ohio'])
| one | two | three | four |
---|
Utah | 8 | 9 | 10 | 11 |
---|
New York | 12 | 13 | 14 | 15 |
---|
你可以通过传递axis=1或axis='columns’来从列中删除值:
data.drop('two', axis=1)
| one | three | four |
---|
Ohio | 0 | 2 | 3 |
---|
Colorado | 4 | 6 | 7 |
---|
Utah | 8 | 10 | 11 |
---|
New York | 12 | 14 | 15 |
---|
data.drop(['two', 'four'], axis='columns')
| one | three |
---|
Ohio | 0 | 2 |
---|
Colorado | 4 | 6 |
---|
Utah | 8 | 10 |
---|
New York | 12 | 14 |
---|
很多函数,例如drop,会修改Series或Dataframe的尺寸或形状,这些方法直接操作原对象而不返回新对象:
obj.drop('c', inplace=True)
obj
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
3. 索引、选择与过滤
Series的索引与numpy数组索引的功能类似,只不过普通python切片不包含尾部,而series的切片不同:
obj['c'] = 2.0
obj = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
obj['b':'c']
b 1.0
c 2.0
dtype: float64
obj['b':'c'] = 5
obj
a 0.0
b 5.0
c 5.0
d 3.0
e 4.0
dtype: float64
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
| one | two | three | four |
---|
Ohio | 0 | 1 | 2 | 3 |
---|
Colorado | 4 | 5 | 6 | 7 |
---|
Utah | 8 | 9 | 10 | 11 |
---|
New York | 12 | 13 | 14 | 15 |
---|
data['two']
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int64
data[['three', 'one']]
| three | one |
---|
Ohio | 2 | 0 |
---|
Colorado | 6 | 4 |
---|
Utah | 10 | 8 |
---|
New York | 14 | 12 |
---|
这种方式也有特殊案例。首先,可以根据一个布尔值数组切片或选择数据:
data[:2]
| one | two | three | four |
---|
Ohio | 0 | 1 | 2 | 3 |
---|
Colorado | 4 | 5 | 6 | 7 |
---|
data[data['three'] > 5]
| one | two | three | four |
---|
Colorado | 4 | 5 | 6 | 7 |
---|
Utah | 8 | 9 | 10 | 11 |
---|
New York | 12 | 13 | 14 | 15 |
---|
data < 5
| one | two | three | four |
---|
Ohio | True | True | True | True |
---|
Colorado | True | False | False | False |
---|
Utah | False | False | False | False |
---|
New York | False | False | False | False |
---|
data[data < 5] = 0
data
| one | two | three | four |
---|
Ohio | 0 | 0 | 0 | 0 |
---|
Colorado | 0 | 5 | 6 | 7 |
---|
Utah | 8 | 9 | 10 | 11 |
---|
New York | 12 | 13 | 14 | 15 |
---|
使用loc和iloc选额数据
data.loc['Colorado', ['two', 'three']]
two 5
three 6
Name: Colorado, dtype: int64
data.iloc[2, [3, 0, 1]]
four 11
one 8
two 9
Name: Utah, dtype: int64
data.iloc[2]
one 8
two 9
three 10
four 11
Name: Utah, dtype: int64
data.iloc[[1, 2], [3, 0, 1]]
| four | one | two |
---|
Colorado | 7 | 0 | 5 |
---|
Utah | 11 | 8 | 9 |
---|
data.loc[:'Utah', 'two']
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int64
data.iloc[:, :3][data.three > 5]
| one | two | three |
---|
Colorado | 0 | 5 | 6 |
---|
Utah | 8 | 9 | 10 |
---|
New York | 12 | 13 | 14 |
---|
对于整数索引,pandas不可以用负索引:
ser = pd.Series(np.arange(3.))
ser
0 0.0
1 1.0
2 2.0
dtype: float64
ser[-1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-67-7ed1c232b3a2> in <module>()
----> 1 ser[-1]
~/.conda/envs/python36/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
599 key = com._apply_if_callable(key, self)
600 try:
--> 601 result = self.index.get_value(self, key)
602
603 if not is_scalar(result):
~/.conda/envs/python36/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
2475 try:
2476 return self._engine.get_value(s, k,
-> 2477 tz=getattr(series.dtype, 'tz', None))
2478 except KeyError as e1:
2479 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4404)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4087)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:14031)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13975)()
KeyError: -1
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2
a 0.0
b 1.0
c 2.0
dtype: float64
ser2[-1]
2.0
使用填充的算数方法
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
columns=list('abcde'))
df2.loc[1, 'b'] = np.nan
df1
| a | b | c | d |
---|
0 | 0.0 | 1.0 | 2.0 | 3.0 |
---|
1 | 4.0 | 5.0 | 6.0 | 7.0 |
---|
2 | 8.0 | 9.0 | 10.0 | 11.0 |
---|
df2
| a | b | c | d | e |
---|
0 | 0.0 | 1.0 | 2.0 | 3.0 | 4.0 |
---|
1 | 5.0 | NaN | 7.0 | 8.0 | 9.0 |
---|
2 | 10.0 | 11.0 | 12.0 | 13.0 | 14.0 |
---|
3 | 15.0 | 16.0 | 17.0 | 18.0 | 19.0 |
---|
df1 + df2
| a | b | c | d | e |
---|
0 | 0.0 | 2.0 | 4.0 | 6.0 | NaN |
---|
1 | 9.0 | NaN | 13.0 | 15.0 | NaN |
---|
2 | 18.0 | 20.0 | 22.0 | 24.0 | NaN |
---|
3 | NaN | NaN | NaN | NaN | NaN |
---|
df1.add(df2, fill_value=0)
| a | b | c | d | e |
---|
0 | 0.0 | 2.0 | 4.0 | 6.0 | 4.0 |
---|
1 | 9.0 | 5.0 | 13.0 | 15.0 | 9.0 |
---|
2 | 18.0 | 20.0 | 22.0 | 24.0 | 14.0 |
---|
3 | 15.0 | 16.0 | 17.0 | 18.0 | 19.0 |
---|
1 / df1
| a | b | c | d |
---|
0 | inf | 1.000000 | 0.500000 | 0.333333 |
---|
1 | 0.250000 | 0.200000 | 0.166667 | 0.142857 |
---|
2 | 0.125000 | 0.111111 | 0.100000 | 0.090909 |
---|
df1.rdiv(1)
| a | b | c | d |
---|
0 | inf | 1.000000 | 0.500000 | 0.333333 |
---|
1 | 0.250000 | 0.200000 | 0.166667 | 0.142857 |
---|
2 | 0.125000 | 0.111111 | 0.100000 | 0.090909 |
---|
灵活算术方法:
方法 | 描述 |
---|
add, radd | 想加 |
sub, rsub | 相减 |
div, rdiv | 相除 |
floordiv, rfloordiv | 整除 |
mul, rmul | 方法 |
pow, rpow | 次方 |
dataframe和series间的操作:
arr = np.arange(12.).reshape((3, 4))
arr
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
arr[0]
array([ 0., 1., 2., 3.])
arr - arr[0]
array([[ 0., 0., 0., 0.],
[ 4., 4., 4., 4.],
[ 8., 8., 8., 8.]])
这就是广播机制
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
frame
| b | d | e |
---|
Utah | 0.0 | 1.0 | 2.0 |
---|
Ohio | 3.0 | 4.0 | 5.0 |
---|
Texas | 6.0 | 7.0 | 8.0 |
---|
Oregon | 9.0 | 10.0 | 11.0 |
---|
series
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
frame - series
| b | d | e |
---|
Utah | 0.0 | 0.0 | 0.0 |
---|
Ohio | 3.0 | 3.0 | 3.0 |
---|
Texas | 6.0 | 6.0 | 6.0 |
---|
Oregon | 9.0 | 9.0 | 9.0 |
---|
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
frame + series2
| b | d | e | f |
---|
Utah | 0.0 | NaN | 3.0 | NaN |
---|
Ohio | 3.0 | NaN | 6.0 | NaN |
---|
Texas | 6.0 | NaN | 9.0 | NaN |
---|
Oregon | 9.0 | NaN | 12.0 | NaN |
---|
4. 函数应用和映射
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
| b | d | e |
---|
Utah | -0.894119 | -1.707091 | 0.843595 |
---|
Ohio | -0.413837 | -0.251321 | 0.440044 |
---|
Texas | 0.871326 | 0.828606 | -0.521924 |
---|
Oregon | 0.603869 | 0.154679 | 2.067279 |
---|
np.abs(frame)
| b | d | e |
---|
Utah | 0.894119 | 1.707091 | 0.843595 |
---|
Ohio | 0.413837 | 0.251321 | 0.440044 |
---|
Texas | 0.871326 | 0.828606 | 0.521924 |
---|
Oregon | 0.603869 | 0.154679 | 2.067279 |
---|
f = lambda x: x.max() - x.min()
frame.apply(f)
b 1.765445
d 2.535697
e 2.589203
dtype: float64
frame.apply(f, axis='columns')
Utah 2.550685
Ohio 0.853881
Texas 1.393250
Oregon 1.912600
dtype: float64
def f(x):
return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
| b | d | e |
---|
min | -0.894119 | -1.707091 | -0.521924 |
---|
max | 0.871326 | 0.828606 | 2.067279 |
---|
逐元素的python函数也可以使用。假设你要根据frame中的每个浮点数计算一个格式化字符串,可以使用applymap方法:
format = lambda x: '%.2f' % x
frame.applymap(format)
| b | d | e |
---|
Utah | -0.89 | -1.71 | 0.84 |
---|
Ohio | -0.41 | -0.25 | 0.44 |
---|
Texas | 0.87 | 0.83 | -0.52 |
---|
Oregon | 0.60 | 0.15 | 2.07 |
---|
series自己有map方法:
frame['e'].map(format)
Utah 0.84
Ohio 0.44
Texas -0.52
Oregon 2.07
Name: e, dtype: object