目录
本篇文章内容来自《利用python进行数据分析》一书,仅为方便平时使用,如有错误请反馈
重建索引
reindex用于创建一个适应新索引的新对象。如果某个索引值当前不存在就进入一个缺失值。
>>> obj = Series([4.5,9.3,-8.4,6.6],index = ['d','b','a','c'])
>>> obj
d 4.5
b 9.3
a -8.4
c 6.6
dtype: float64
>>> obj2 = obj.reindex(['a','b','c','d','e'])
>>> obj2
a -8.4
b 9.3
c 6.6
d 4.5
e NaN
dtype: float64
>>> obj2 = obj.reindex(['a','b','c','d','e'],fill_value=0)
>>> obj2
a -8.4
b 9.3
c 6.6
d 4.5
e 0.0
dtype: float64
使用method选项进行向前填充
>>> obj3 = Series(['blue','purple','yellow'],index = [0,2,4])
>>> ojb3.reindex(range(6),method='ffill')
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
丢弃指定轴上的项
丢弃某条轴上的一个或多个项目很简单,只需有一个索引数组或列表即可,由于执行一些数据对象需要数据整理和集合逻辑,所以drop方法返回的是在一个指定轴上删除了指定值得新对象:
>>> from pandas import DataFrame,Series
Backend TkAgg is interactive backend. Turning interactive mode on.
import numpy as np
>>> obj = Series(np.arange(5.),index=['a','b','c','d','e'])
>>> new_obj = obj.drop('c')
>>> new_obj
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
>>> obj.drop(['d','c'])
a 0.0
b 1.0
e 4.0
dtype: float64
>>> data = DataFrame(np.arange(16).reshape((4,4)),index=['A','B','C','D'],columns=['one','two','three','four'])
>>> data
one two three four
A 0 1 2 3
B 4 5 6 7
C 8 9 10 11
D 12 13 14 15
>>> data.drop(['B','D'])
one two three four
A 0 1 2 3
C 8 9 10 11
>>> data.drop(['one','three'],axis=1)
two four
A 1 3
B 5 7
C 9 11
D 13 15
索引,选取和过滤
>>> obj = Series(np.arange(4.),index=['a','b','c','d'])
>>> obj['b']
1.0
>>> obj[1]
1.0
>>> obj[2:4]
c 2.0
d 3.0
dtype: float64
>>> obj[['b','c']]
b 1.0
c 2.0
dtype: float64
>>> obj[obj>2]
d 3.0
dtype: float64
利用标签切片与普通的python切片运算不同,其末端是包含的(inclusive),即对DataFrame进行索引其实就是获取一个或多个列
>>> data = DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
>>> data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
>>> data['two']
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
>>> data[['two','three']]
two three
Ohio 1 2
Colorado 5 6
Utah 9 10
New York 13 14
这种索引方式有几个特殊的情况,首先通过切片或布尔型数组选取行:
>>> data[:2]
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
>>> data[data['three']>5]
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
也可以通过布尔型DataFrame进行索引
>>> data < 5
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
>>> data[data < 5] = 0
>>> data
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
为了在DataFrame上进行标签索引,这里引入专门的索引字段ix,可以通过Numpy式的标记法及轴标签从DataFrame中选取行和列的子集
>>> import pandas as pd
>>> data = DataFrame(np.arange(16).reshape((4,4)),index=['A','B','C','D'],columns=['one','two','three','four'])
>>> data.ix['A',['one','three']]
one 0
three 2
Name: A, dtype: int32
>>> data.ix[['A','C'],[3,0,1]]
four one two
A 3 0 1
C 11 8 9
>>> data.ix[2]
one 8
two 9
three 10
four 11
Name: C, dtype: int32
>>> data.ix[:'C','four']
A 3
B 7
C 11
Name: four, dtype: int32
>>> data.ix[data.three > 5,:3]
one two three
B 4 5 6
C 8 9 10
D 12 13 14
pandans的一个重要功能就是可以对不同索引对象进行算数运算,在对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。
自动的对其操作在不重叠的所引出引入了NA值。缺失值会在算数运算中传播。
>>> s1 = Series([2.1,3.1,4.1,5.1,6.1],index=['a','b','c','d','e'])
>>> s2 = Series([-2.3,4.5,5.6,7.8],index=['a','c','d','e'])
>>> s1,s2
(a 2.1
b 3.1
c 4.1
d 5.1
e 6.1
dtype: float64, a -2.3
c 4.5
d 5.6
e 7.8
dtype: float64)
>>> s1+s2
a -0.2
b NaN
c 8.6
d 10.7
e 13.9
dtype: float64
对于DataFrame,对其操作会同时发生在行和列上:相加后返回一个新的DataFrame,其索引和列为原来两个DatFrame的并集。
>>> def1 = DataFrame(np.arange(9.).reshape((3,3)),columns=list('bcd'),index=['Ohio','Texas','Colorado'])
>>> def2 = DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
>>> def1+def2
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
在算数方法中填充值
在对不同索引对象进行算术运算时,没有重叠的位置就会产生Nan值,使用爱到底add()方法可以指定一个填充值:
>>> df1 = DataFrame(np.arange(12.).reshape((3,4)),columns=list('abcd'))
>>> df2 = DataFrame(np.arange(20.).reshape((4,5)),columns=list('abcde'))
>>> df1
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
>>> df2
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
>>> df1+df2
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 11.0 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
>>> df1.reindex(columns=df2.columns,fill_value=0)
a b c d e
0 0.0 1.0 2.0 3.0 0
1 4.0 5.0 6.0 7.0 0
2 8.0 9.0 10.0 11.0 0
DataFrame与Series之间的运算
DataFrame与Series之间运算是有明确规定的,下面的例子展示了一个二维数组与其某行的差的运算
该过程成为广播,DataFrame与Series的运算也是如此。
>>> arr = np.arange(12.).reshape(3,4)
>>> arr
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
>>> arr[2]
array([ 8., 9., 10., 11.])
>>> arr-arr[0]
array([[0., 0., 0., 0.],
[4., 4., 4., 4.],
[8., 8., 8., 8.]])
默认情况下,DataFrame和Series之间的算数运算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播。
>>> from pandas import DataFrame,Series
Backend TkAgg is interactive backend. Turning interactive mode on.
>>> import numpy as np
>>> frame = DataFrame(np.arange(12.).reshape((4,3)),columns = list('bde'),index=['Utah','Ohio','Texas','OreGon'])
>>> frame
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
OreGon 9.0 10.0 11.0
>>> series = frame.ix[0]
>>> series
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
>>> frame-series
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
OreGon 9.0 9.0 9.0
如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集。
>>> series2 = Series(range(3),index=['b','e','f'])
>>> frame+series2
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
OreGon 9.0 NaN 12.0 NaN
如果你希望匹配行,且在列上广播,则必须使用算数运算的方法。传入的轴号就是希望匹配的轴,在本例中我们的目的是匹配DataFrame的行索引并进行广播。
>>> series3 = frame['d']
>>> frame
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
OreGon 9.0 10.0 11.0
>>> series3
Utah 1.0
Ohio 4.0
Texas 7.0
OreGon 10.0
Name: d, dtype: float64
>>> frame.sub(series3,axis=0)
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
OreGon -1.0 0.0 1.0
函数应用和映射
numpy的元素级数组方法也可以用于操作pandas对象
>>> frame = DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
>>> frame
b d e
Utah 0.225392 0.944038 -0.286161
Ohio -0.075078 -1.416288 -1.681523
Texas 1.674864 2.292591 0.433947
Oregon 0.525176 1.926218 -0.891167
>>> np.abs(frame)
b d e
Utah 0.225392 0.944038 0.286161
Ohio 0.075078 1.416288 1.681523
Texas 1.674864 2.292591 0.433947
Oregon 0.525176 1.926218 0.891167
另一种操作是,将函数应用到由各列或行所形成的一维数组上。DataFrame的apply方法即可实现此功能
>>> frame = DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
>>> frame
b d e
Utah 0.225392 0.944038 -0.286161
Ohio -0.075078 -1.416288 -1.681523
Texas 1.674864 2.292591 0.433947
Oregon 0.525176 1.926218 -0.891167
>>> np.abs(frame)
b d e
Utah 0.225392 0.944038 0.286161
Ohio 0.075078 1.416288 1.681523
Texas 1.674864 2.292591 0.433947
Oregon 0.525176 1.926218 0.891167
>>> f = lambda x: x.max()-x.min()
>>> frame.apply(f)
b 1.749942
d 3.708879
e 2.115470
dtype: float64
>>> frame.apply(f,axis=1)
Utah 1.230200
Ohio 1.606445
Texas 1.858644
Oregon 2.817385
dtype: float64
排序和排名
根据条件对数据集排序是一种重要的内置运算,使用sort_index方法,将返回一个已排序的新对象。
>>> from pandas import DataFrame,Series
Backend TkAgg is interactive backend. Turning interactive mode on.
>>> import numpy as np
>>> obj = Series(range(4),index = ['d','a','b','c'])
>>> obj.sort_index()
a 1
b 2
c 3
d 0
dtype: int64
对于DataFrame,可以根据任意一个轴上的索引进行排序:
数据默认是按升序排序的,但也可以降序排序。
>>> frame = DataFrame(np.arange(8).reshape(2,4),index=['three','one'],columns=['d','a','b','c'])
>>> frame.sort_index()
d a b c
one 4 5 6 7
three 0 1 2 3
>>> frame.sort_index(axis=1)
a b c d
three 1 2 3 0
one 5 6 7 4
>>> frame.sort_index(axis=0)
d a b c
one 4 5 6 7
three 0 1 2 3
>>> frame.sort_index(axis=1,ascending=False)
d c b a
three 0 3 2 1
one 4 7 6 5
Series排序
>>> ojb = Series([4,7,-3,2])
>>> obj.order()
>>> obj.sort_values()
d 0
a 1
b 2
c 3
dtype: int64
>>>
>>> obj.sort_index()
a 1
b 2
c 3
d 0
dtype: int64
如果希望根据一个或多个列中的值进行排序,将列名传给by即可
>>> frame = DataFrame({'b':[4,7,3,8],'a':[0,1,0,1]})
>>> frame
a b
0 0 4
1 1 7
2 0 3
3 1 8
>>> frame.sort_values(by='b')
a b
2 0 3
0 0 4
1 1 7
3 1 8
>>> frame.sort_values(by=['a','b'])
a b
2 0 3
0 0 4
1 1 7
3 1 8
排名与排序关系密切,它会增设一个排名值,并为各组分配一个平均值来破坏评级关系
>>> obj = Series([3,6,9,-2,-4,7,3,7])
>>> obj.rank()
0 3.5
1 5.0
2 8.0
3 2.0
4 1.0
5 6.5
6 3.5
7 6.5
dtype: float64
>>> obj.rank(method='first')
0 3.0
1 5.0
2 8.0
3 2.0
4 1.0
5 6.0
6 4.0
7 7.0
dtype: float64
>>> obj.rank(ascending=False,method='max')
0 6.0
1 4.0
2 1.0
3 7.0
4 8.0
5 3.0
6 6.0
7 3.0
dtype: float64
DataFrame可以在行或列上计算排名
>>> frame = DataFrame({'b':[4.3,7,-3,2],'a':[0,1,0,1],'c':[-2,5,8,-2.5]})
>>> frame
a b c
0 0 4.3 -2.0
1 1 7.0 5.0
2 0 -3.0 8.0
3 1 2.0 -2.5
>>> frame.rank(axis=1)
a b c
0 2.0 3.0 1.0
1 1.0 3.0 2.0
2 2.0 1.0 3.0
3 2.0 3.0 1.0
带有重复值的轴索引
带有重复值的Serise和判断其是否重复的函数
>>> obj = Series(range(5),index=['a','a','b','b','c'])
>>> obj
a 0
a 1
b 2
b 3
c 4
dtype: int64
>>> obj.index.is_unique
False
>>> obj['a']
a 0
a 1
obj['c']
4
对DatFrame的行进行索引时也是如此:
>>> df = DataFrame(np.random.randn(4,3),index=['a','a','b','b'])
>>> df
0 1 2
a -1.619126 0.134523 0.906778
a 0.748143 0.528331 0.470493
b -0.480982 0.876438 -0.772287
b -0.223553 0.002319 -0.850182
>>> df.ix['b']
0 1 2
b -0.480982 0.876438 -0.772287
b -0.223553 0.002319 -0.850182