介绍操作Series和DataFrame中的数据的基本功能
重新索引
pandas对象的一个重要方法是reindex,其作用是创建一个适应新索引的新对象。以之前的一个简单示例来说
In [1]: from pandas import Series,DataFrame
In [2]: import pandas as pd
In [3]: import numpy as np
In [4]: obj=Series([6.5,7.8,-5.9,8.6],index=['d','b','a','c'])
In [5]: obj
Out[5]:
d 6.5
b 7.8
a -5.9
c 8.6
dtype: float64
调用该Series的reindex将会根据新索引进行重排。如果某个索引值当前不存在,就引入缺失值
In [6]: obj2=obj.reindex(['a', 'b', 'c', 'd', 'e'])
In [7]: obj2
Out[7]:
a -5.9
b 7.8
c 8.6
d 6.5
e NaN
dtype: float64
In [8]: obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
Out[8]:
a -5.9
b 7.8
c 8.6
d 6.5
e 0.0
dtype: float64
In [9]: obj3=Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
In [10]: obj3.reindex(range(6), method='ffill')
Out[10]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
In [11]: obj3.reindex(range(6), method='bfill')
Out[11]:
0 blue
1 purple
2 purple
3 yellow
4 yellow
5 NaN
dtype: object
In [12]: obj3.reindex(range(6), method='pad')
Out[12]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
对于DataFrame,reindex可以修改(行)索引、列,或两个都修改。如果仅传入一个序列,则会重新索引行
In [13]: frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],columns=['Ohio', 'Texas', 'California'])
In [14]: frame
Out[14]:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
In [15]: frame2=frame.reindex(['a', 'b', 'c', 'd'])
In [16]: frame2
Out[16]:
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
使用columns关键字即可重新索引列
In [17]: states = ['Texas', 'Utah', 'California']
In [18]: frame.reindex(columns=states)
Out[18]:
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
利用ix的标签索引功能
In [28]: frame
Out[28]:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
In [31]: states = ['Texas', 'Utah', 'California']
In [32]: frame.ix[['a', 'b', 'c', 'd'], states]
Out[32]:
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
c 4.0 NaN 5.0
d 7.0 NaN 8.0
丢弃某条轴上的一个或多个项很简单,只要有一个索引数组或列表即可。由于需要执行一些数据整理和集合逻辑,所以drop方法返回的是一个在指定轴上删除了指定值的新对象
In [33]: obj=Series(np.arange(5.),index=['a', 'b', 'c', 'd', 'e'])
In [34]: obj
Out[34]:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
In [35]: new_obj=obj.drop('c')
In [36]: new_obj
Out[36]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
In [37]: obj.drop(['d','b'])
Out[37]:
a 0.0
c 2.0
e 4.0
dtype: float64
对于DataFrame,可以删除任意轴上的索引值
In [41]: data = DataFrame(np.arange(16).reshape((4, 4)),
...: index=['Ohio', 'Colorado', 'Utah', 'New York'],
...: columns=['one', 'two', 'three', 'four'])
In [42]: data
Out[42]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [43]: data.drop(['Colorado', 'Ohio'])
Out[43]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15
In [44]: data.drop('two',axis=1)
Out[44]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
In [45]: data.drop(['two', 'four'], axis=1)
Out[45]:
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
索引、选取和过滤
Series的索引值不只是整数
In [47]: obj=Series(np.arange(5.),index=[