1. 重新索引
Pandas 对象的一个重要方法是 reindex,其作用是创建一个适应新索引的新对象。
In [1]: from pandas import Series, DataFrame
In [2]: import pandas as pd
In [3]: obj = Series([1,2,3,4],index=['d','c','b','a'])
In [4]: obj
Out[4]:
d 1
c 2
b 3
a 4
dtype: int64
调用该 Series 的 reindex 将会根据新索引进行重排。如果某个索引值当前不存在,就引入缺失值。
obj2 = obj.reindex(['a','b','c','d','e'])
obj2
Out[8]:
a 4.0
b 3.0
c 2.0
d 1.0
e NaN
dtype: float64
obj2 = obj.reindex(['a','b','c','d','e'],fill_value=0)
obj2
Out[10]:
a 4
b 3
c 2
d 1
e 0
dtype: int64
对于时间序列这样的有序数据,重新索引时可能需要做一些插值处理,,method 选项即可达到此目的。
In [11]: obj3 = Series(['beijing','shanghai','guangzhou'],index=[0,2,4])
In [12]: obj3.reindex(range(6),method='ffill')
Out[12]:
0 beijing
1 beijing
2 shanghai
3 shanghai
4 guangzhou
5 guangzhou
dtype: object
reindex 的插值 method 选项说明:
-
ffill 或 pad :前向填充(或搬运)值
-
bfill 或 backfill:后向填充(或搬运)值
对于 DataFrame, reindex 可以修改行索引、列、或者两个都修改,如果仅传入一个序列,则会重新索引行:
In [14]: frame = DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],
...: columns=['xian','beijing','chongqing'])
In [15]: frame
Out[15]:
xian beijing chongqing
a 0 1 2
c 3 4 5
d 6 7 8
In [16]: frame2 = frame.reindex(['a','b','c','d'])
In [17]: frame2
Out[17]:
xian beijing chongqing
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
In [20]: city = ['xian','beijing','shenzhen']
# 使用 columns 关键字即可重新索引列
In [21]: frame.reindex(columns=city)
Out[21]:
xian beijing shenzhen
a 0 1 NaN
c 3 4 NaN
d 6 7 NaN
也可以同时对行和列进行重新索引,而插值则只能按行应用(即轴0)
In [25]: frame
Out[25]:
xian beijing chongqing
a 0 1 2
c 3 4 5
d 6 7 8
In [26]: frame.reindex(index=['a','b','c','d'],method='ffill').reindex(columns=city)
Out[26]:
xian beijing shenzhen
a 0 1 NaN
b 0 1 NaN
c 3 4 NaN
d 6 7 NaN
利用 ix 的标签索引功能,重新索引任务可以更加简洁:
In [27]: frame.ix[['a','b','c','d'],city]
Out[27]:
xian beijing shenzhen
a 0.0 1.0 NaN
b NaN NaN NaN
c 3.0 4.0 NaN
d 6.0 7.0 NaN
2. 丢弃指定轴上的项
2.1 Series
使用 drop 方法,该方法返回的是一个在指定轴上删除了指定值的新对象
In [28]: obj = Series(np.arange(5),index=['a','b','c','d','e'])
In [29]: new_obj = obj.drop('c')
In [30]: new_obj
Out[30]:
a 0
b 1
d 3
e 4
dtype: int32
In [31]: obj.drop(['d','c'])
Out[31]:
a 0
b 1
e 4
dtype: int32
In [32]: obj
Out[32]:
a 0
b 1
c 2
d 3
e 4
dtype: int32
2.2 DataFrame
对于 DataFrame,可以删除任意轴上的索引值
In [33]: data = DataFrame(np.arange(16).reshape((4,4)),
...: index=['xian','shenzhen','guangzhou','wuhan'],
...: columns=['a','b','c','d'])
In [34]: data
Out[34]:
a b c d
xian 0 1 2 3
shenzhen 4 5 6 7
guangzhou 8 9 10 11
wuhan 12 13 14 15
In [35]: data.drop(['xian','shenzhen'])
Out[35]:
a b c d
guangzhou 8 9 10 11
wuhan 12 13 14 15
# axis 默认为0 代表 index,如果要删除列,则 axis 必须为 1
In [36]: data.drop('b',axis=1)
Out[36]:
a c d
xian 0 2 3
shenzhen 4 6 7
guangzhou 8 10 11
wuhan 12 14 15
In [37]: data.drop(['c','d'],axis=1)
Out[37]:
a b
xian 0 1
shenzhen 4 5
guangzhou 8 9
wuhan 12 13
2.3 索引、选取和过滤
-
Series
In [42]: obj = Series(np.arange(4),index=['a','b','c','d']) In [43]: obj Out[43]: a 0 b 1 c 2 d 3 dtype: int32 In [44]: obj['b'] Out[44]: 1 In [45]: obj[1] Out[45]: 1 In [46]: obj[2:4] Out[46]: c 2 d 3 dtype: int32 In [48]: obj[['b','a','d']] Out[48]: b 1 a 0 d 3 dtype: int32 In [49]: obj[[1,3]] Out[49]: b 1 d 3 dtype: int32 In [50]: obj[obj<2] Out[50]: a 0 b 1 dtype: int32 # 标签的切片运算是包含末端的 In [51]: obj['b':'d'] Out[51]: b 1 c 2 d 3 dtype: int32
设置的方式:
In [52]: obj['b':'d'] = 100 In [53]: obj Out[53]: a 0 b 100 c 100 d 100 dtype: int32
-
DataFrame
对于 DataFrame 进行索引其实就是获取一个或多个列:
In [54]: data = DataFrame(np.arange(16).reshape((4,4)), ...: index=['xian','shenzhen','guangzhou','wuhan'], ...: columns=['a','b','c','d']) In [55]: data Out[55]: a b c d xian 0 1 2 3 shenzhen 4 5 6 7 guangzhou 8 9 10 11 wuhan 12 13 14 15 In [56]: data['b'] Out[56]: xian 1 shenzhen 5 guangzhou 9 wuhan 13 Name: b, dtype: int32 In [57]: data[['c','d']] Out[57]: c d xian 2 3 shenzhen 6 7 guangzhou 10 11 wuhan 14 15 In [59]: data[:2] Out[59]: a b c d xian 0 1 2 3 shenzhen 4 5 6 7 In [60]: data[data['c']>5] Out[60]: a b c d shenzhen 4 5 6 7 guangzhou 8 9 10 11 wuhan 12 13 14 15
通过布尔型 DataFrame 进行索引
In [61]: data < 5 Out[61]: a b c d xian True True True True shenzhen True False False False guangzhou False False False False wuhan False False False False In [62]: data[data<5] = 0 In [63]: data Out[63]: a b c d xian 0 0 0 0 shenzhen 0 5 6 7 guangzhou 8 9 10 11 wuhan 12 13 14 15
为了在 DataFrame 的行上进行标签索引,引入专门的索引字段 ix,可以通过 NumPy 式的标记法以及轴标签从 DataFrame 选取行和列的子集。
In [65]: data.ix['shenzhen',['b','c']] Out[65]: b 5 c 6 Name: shenzhen, dtype: int32 # [3,0,1] 代表选取 data 的第 3、0、1列 In [67]: data.ix[['shenzhen','wuhan'],[3,0,1]] Out[67]: d a b shenzhen 7 0 5 wuhan 15 12 13 In [68]: data.ix[2] Out[68]: a 8 b 9 c 10 d 11 Name: guangzhou, dtype: int32 In [69]: data.ix[:'shenzhen','b'] Out[69]: xian 0 shenzhen 5 Name: b, dtype: int32 In [70]: data.ix[data.c > 5, :3] Out[70]: a b c shenzhen 0 5 6 guangzhou 8 9 10 wuhan 12 13 14
[外链图片转存失败(img-4KDuEYER-1562076864605)(pandas-基本功能\DataFrame的索引选项.jpg)]
2.4 算术运算和数据对齐
Pandas 最重要的一个功能是,它可以对不同索引的对象进行算术运算。在将对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集:
-
Series
In [71]: s1 = Series([1,2,3,4],index=['a','b','c','d']) In [72]: s2 = Series([5,6,7,8],index=['a','b','c','e']) In [73]: s1 Out[73]: a 1 b 2 c 3 d 4 dtype: int64 In [74]: s2 Out[74]: a 5 b 6 c 7 e 8 dtype: int64 In [75]: s1 + s2 Out[75]: a 6.0 b 8.0 c 10.0 d NaN e NaN dtype: float64
-
DataFrame
对于 DataFrame ,对齐操作会同时发生在行和列上:
In [76]: df1 = DataFrame(np.arange(9).reshape((3,3)),columns=list('bcd'), ...: index=['xian','shenzhen','beijing']) In [77]: df2 = DataFrame(np.arange(12).reshape((4,3)),columns=list('bde'), ...: index=['xian','shenzhen','wuhan','hangzhou']) In [78]: df1 Out[78]: b c d xian 0 1 2 shenzhen 3 4 5 beijing 6 7 8 In [79]: df2 Out[79]: b d e xian 0 1 2 shenzhen 3 4 5 wuhan 6 7 8 hangzhou 9 10 11 In [80]: df1 + df2 Out[80]: b c d e beijing NaN NaN NaN NaN hangzhou NaN NaN NaN NaN shenzhen 6.0 NaN 9.0 NaN wuhan NaN NaN NaN NaN xian 0.0 NaN 3.0 NaN
2.5 在算术方法中填充值
在对不同索引的对象进行算术运算时,希望当一个对象中某个轴标签在另外一个对象中找不到时填充一个特殊值:
In [84]: df1 = DataFrame(np.arange(12).reshape((3,4)),columns=list('abcd'))
In [85]: df2 = DataFrame(np.arange(20).reshape((4,5)),columns=list('abcde'))
In [86]: df1
Out[86]:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
In [87]: df2
Out[87]:
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
In [88]: df1 + df2
Out[88]:
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 11.0 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
使用 df1 的 add 方法,传入 df2 以及一个 fill_value 参数
In [89]: df1.add(df2, fill_value=0)
Out[89]:
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 11.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
常用的算术方法
方法 | 说明 |
---|---|
add | 用于加法(+)的方法 |
sub | 用于减法(-)的方法 |
mul | 用于乘法(*)的方法 |
div | 用于除法(/)的方法 |
2.6 DataFrame 和 Series 之间的运算
这个叫做广播
In [90]: arr = np.arange(12).reshape((3,4))
In [91]: arr
Out[91]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [92]: arr - arr[0]
Out[92]:
array([[0, 0, 0, 0],
[4, 4, 4, 4],
[8, 8, 8, 8]])
In [93]: frame = DataFrame(np.arange(12).reshape((4,3)),columns=list('bde'),
...: index=['xian','wuhan','guangzhou','chongqing'])
In [93]:
In [94]: series = frame.ix[0]
In [95]: frame
Out[95]:
b d e
xian 0 1 2
wuhan 3 4 5
guangzhou 6 7 8
chongqing 9 10 11
In [96]: series
Out[96]:
b 0
d 1
e 2
Name: xian, dtype: int32
默认情况下,DataFrame 和 Series 之间的算术运算会将 Series 的索引匹配到 DataFrame 的列,然后沿着行一直向下广播:
In [97]: frame - series
Out[97]:
b d e
xian 0 0 0
wuhan 3 3 3
guangzhou 6 6 6
chongqing 9 9 9
如果某个索引值在 DataFrame 的列或 Series 的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集:
In [98]: series2 = Series(range(3), index=['b','e','f'])
In [99]: frame + series2
Out[99]:
b d e f
xian 0.0 NaN 3.0 NaN
wuhan 3.0 NaN 6.0 NaN
guangzhou 6.0 NaN 9.0 NaN
chongqing 9.0 NaN 12.0 NaN
如果希望匹配行且在列上广播,则必须使用算术运算方法。
In [101]: series3 = frame['d']
In [102]: frame
Out[102]:
b d e
xian 0 1 2
wuhan 3 4 5
guangzhou 6 7 8
chongqing 9 10 11
In [103]: series3
Out[103]:
xian 1
wuhan 4
guangzhou 7
chongqing 10
Name: d, dtype: int32
# 传入的轴号就是希望匹配的轴
In [104]: frame.sub(series3, axis=0)
Out[104]:
b d e
xian -1 0 1
wuhan -1 0 1
guangzhou -1 0 1
chongqing -1 0 1
2.7 函数应用和映射
P148
2.8 排序和排名
要对行或列进行排序(按字典顺序),使用 sort_index 方法,它将返回一个已排序的新对象:
In [105]: obj = Series(range(4),index=['d','a','b','c'])
In [106]: obj
Out[106]:
d 0
a 1
b 2
c 3
dtype: int64
In [107]: obj.sort_index()
Out[107]:
a 1
b 2
c 3
d 0
dtype: int64
对于 DataFrame,则可以根据任意一个轴上的索引进行排序:
In [108]: frame = DataFrame(np.arange(8).reshape((2,4)),index=['three','one'],
...: columns=['d','a','b','c'])
In [109]: frame
Out[109]:
d a b c
three 0 1 2 3
one 4 5 6 7
In [110]: frame.sort_index()
Out[110]:
d a b c
one 4 5 6 7
three 0 1 2 3
In [111]: frame.sort_index(axis=1)
Out[111]:
a b c d
three 1 2 3 0
one 5 6 7 4
默认是按升序排序的,也可以降序排序:
In [112]: frame.sort_index(axis=1,ascending=False)
Out[112]:
d c b a
three 0 3 2 1
one 4 7 6 5
若要按值对 Series 进行排序,可使用其 sort_values方法:
In [113]: obj = Series(4,7,-3,2)
In [116]: obj
Out[116]:
0 4
1 7
2 -3
3 2
dtype: int64
In [118]: obj.sort_values()
Out[118]:
2 -3
3 2
0 4
1 7
dtype: int64
在排序时,任何缺失值默认都会被放到 Series 的末尾:
In [119]: obj = Series([4,np.nan,7,np.nan,-3,2])
In [120]: obj.sort_values()
Out[120]:
4 -3.0
5 2.0
0 4.0
2 7.0
1 NaN
3 NaN
dtype: float64
在 DataFrame 中,可以根据一个或多个列中的值进行排序,将列名传递给 by 关键字参数:
In [123]: frame = DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
In [124]: frame
Out[124]:
a b
0 0 4
1 1 7
2 0 -3
3 1 2
In [126]: frame.sort_values(by='b')
Out[126]:
a b
2 0 -3
3 1 2
0 0 4
1 1 7
要根据多个列进行排序,传入名称的列表即可:
In [127]: frame.sort_values(by=['a','b'])
Out[127]:
a b
2 0 -3
0 0 4
3 1 2
1 1 7
默认情况下,rank 是通过 “为各组分配一个平均排名” 的方式破坏平级关系的。
In [128]: obj = Series([7,-5,7,4,2,0,4])
In [129]: obj.rank()
Out[129]:
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
# 根据值在原数据中出现的顺序给出排名
In [130]: obj.rank(method='first')
Out[130]:
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
# 降序排名
In [131]: obj.rank(ascending=False,method='max')
Out[131]:
0 2.0
1 7.0
2 2.0
3 4.0
4 5.0
5 6.0
6 4.0
dtype: float64
DataFrame 可以在行或列上计算排名
In [132]: frame = DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
In [133]: frame
Out[133]:
a b
0 0 4
1 1 7
2 0 -3
3 1 2
In [134]: frame.rank(axis=1)
Out[134]:
a b
0 1.0 2.0
1 1.0 2.0
2 2.0 1.0
3 1.0 2.0
2.9 带有重复值的轴索引
对于带有重复值的索引,如果某个索引对应多个值,则返回一个 Series;而对应单个值的,则返回一个标量值:
In [135]: obj = Series(range(5), index=['a','a','b','b','c'])
In [136]: obj
Out[136]:
a 0
a 1
b 2
b 3
c 4
dtype: int64
In [137]: obj.index.is_unique
Out[137]: False
In [138]: obj['a']
Out[138]:
a 0
a 1
dtype: int64
In [139]: obj['c']
Out[139]: 4
对 DataFrame 进行索引时原理同上:
In [140]: df = DataFrame(np.random.randn(4,3),index=['a','a','b','b'])
In [141]: df
Out[141]:
0 1 2
a -0.882972 1.028678 -0.867953
a 0.453870 -0.057848 0.671163
b -1.035427 -0.186319 1.917317
b -0.305498 -1.377157 -0.385813
In [142]: df.ix['b']
Out[142]:
0 1 2
b -1.035427 -0.186319 1.917317
b -0.305498 -1.377157 -0.385813