《利用Python进行数据分析》学习笔记ch05(6)

第5章 pandas入门

pandas是基于NumPy构建的,让以NumPy为中心的应用变得更加简单。

pandas的数据结构介绍

两个主要数据结构:Series和DataFrame

Series

由一组数据以及一组与之相关的数据标签(即索引)组成

from pandas import Series,DataFrame
import numpy as np
import pandas as pd
obj=Series([4,7,-5,3])
obj
0    4
1    7
2   -5
3    3
dtype: int64
obj.values
array([ 4,  7, -5,  3], dtype=int64)
obj.index
RangeIndex(start=0, stop=4, step=1)

带有对各个数据点进行标记的索引:

obj2=Series([4,7,-5,3],index=['d','b','a','c'])
obj2
d    4
b    7
a   -5
c    3
dtype: int64
obj2.index
Index(['d', 'b', 'a', 'c'], dtype='object')

通过索引的方式选取Series中的单个或一组值:

obj2['a']
-5
obj2['d']=6
obj2
d    6
b    7
a   -5
c    3
dtype: int64
obj2[['c','a','d']]
c    3
a   -5
d    6
dtype: int64

NumPy数组运算会保留索引和值之间的链接:

obj2[obj2>0]
d    6
b    7
c    3
dtype: int64
obj2>0
d     True
b     True
a    False
c     True
dtype: bool
obj2*2
d    12
b    14
a   -10
c     6
dtype: int64
np.exp(obj2)
d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

可以将Series看成是一个定长的有序字典

'b' in obj2
True
'e' in obj2
False

如果数据被存放一个Python字典中,也可以直接通过这个字典来创建Series:

sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
obj3=Series(sdata)
obj3
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
sdata
{'Ohio': 35000, 'Oregon': 16000, 'Texas': 71000, 'Utah': 5000}
states=['California','Ohio','Oregon','Texas']
obj4=Series(sdata,index=states)
obj4
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

pandas的isnull和notnull函数可用于检测缺失数据:

pd.isnull(obj4)
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
pd.notnull(obj4)
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
obj4.isnull()
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Series最重要的一个功能就是:它在算术运算中会自动对齐不同索引的数据。

obj3
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
obj4
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
obj3+obj4
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Series对象本身及其索引都有一个name属性,该属性跟pandas其他的关键功能关系非常密切:

obj4.name='population'
obj4.index.name='state'
obj4
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

Series的索引可以通过赋值的方式就地修改:

obj.index=['Bob','Steve','Jeff','Ryan']
obj
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

DataFrame

data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
      'year':[2000,2001,2002,2001,2002],
      'pop':[1.5,1.7,3.6,2.4,2.9]}
frame=DataFrame(data)
frame
popstateyear
01.5Ohio2000
11.7Ohio2001
23.6Ohio2002
32.4Nevada2001
42.9Nevada2002

按照指定顺序进行排列:

DataFrame(data,columns=['year','state','pop'])
yearstatepop
02000Ohio1.5
12001Ohio1.7
22002Ohio3.6
32001Nevada2.4
42002Nevada2.9

如果传入的列在数据中找不到,就会产生NA值

frame2=DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five'])
frame2
yearstatepopdebt
one2000Ohio1.5NaN
two2001Ohio1.7NaN
three2002Ohio3.6NaN
four2001Nevada2.4NaN
five2002Nevada2.9NaN
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')
frame2['state']
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
frame2.year
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

返回的Series拥有原DataFrame相同的索引,行也可以通过位置或名称的方式进行获取,比如用索引字段ix

frame2.ix['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

列可以通过赋值的方式进行修改。

frame2['debt']=16.5
frame2
yearstatepopdebt
one2000Ohio1.516.5
two2001Ohio1.716.5
three2002Ohio3.616.5
four2001Nevada2.416.5
five2002Nevada2.916.5
frame2['debt']=np.arange(5.)
frame2
yearstatepopdebt
one2000Ohio1.50.0
two2001Ohio1.71.0
three2002Ohio3.62.0
four2001Nevada2.43.0
five2002Nevada2.94.0

将列表或数组赋值给某个列时,其长度必须跟DataFrame的长度相匹配,如果赋值的是一个Series,就会精确匹配DataFrame的索引,所有的空位都将被填上缺失值:

val=Series([-1.2,-1.5,-1.7],index=['two','four','five'])
frame2['debt']=val
frame2
yearstatepopdebt
one2000Ohio1.5NaN
two2001Ohio1.7-1.2
three2002Ohio3.6NaN
four2001Nevada2.4-1.5
five2002Nevada2.9-1.7

为不存在的列赋值会创建一个新列,关键字del用于删除列

frame2['eastern']=frame2.state=='Ohio'
frame2
yearstatepopdebteastern
one2000Ohio1.5NaNTrue
two2001Ohio1.7-1.2True
three2002Ohio3.6NaNTrue
four2001Nevada2.4-1.5False
five2002Nevada2.9-1.7False
del frame2['eastern']
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')

对返回的Series所做的任何就地修改全都会反映到源DataFrame上。

另一种常见的数据形式是嵌套字典:

pop={'Nevada':{2001:2.4,2002:2.9},
    'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame3=DataFrame(pop)
frame3
NevadaOhio
2000NaN1.5
20012.41.7
20022.93.6

外层字典的键作为列,内层键作为行索引

frame3.T #转置
200020012002
NevadaNaN2.42.9
Ohio1.51.73.6
DataFrame(pop,index=[2001,2002,2003])
NevadaOhio
20012.41.7
20022.93.6
2003NaNNaN

由Series组成的字典差不多也是一样的用法:

pdata={'Ohio':frame3['Ohio'][:-1],
      'Nevada':frame3['Nevada'][:2]}
DataFrame(pdata)
NevadaOhio
2000NaN1.5
20012.41.7
frame3.index.name='year'
frame3.columns.name='state'
frame3
stateNevadaOhio
year
2000NaN1.5
20012.41.7
20022.93.6
frame3.values
array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])
frame2.values
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

索引对象

pandas的索引对象负责管理轴标签和其他元数据(比如轴名称等)。构建Series或DataFrame时,所用到的任何数组或其他序列的标签都会被转换成一个index:

obj=Series(range(3),index=['a','b','c'])
index=obj.index
index
Index(['a', 'b', 'c'], dtype='object')
index[1:]
Index(['b', 'c'], dtype='object')

Index对象是不可修改的(immutable),因此用户不能对其进行修改。不可修改性非常重要,因为这样才能使Index对象在多个数据结构之间安全共享:

index=pd.Index(np.arange(3))
obj2=Series([1.5,-2.5,0],index=index)
obj2
0    1.5
1   -2.5
2    0.0
dtype: float64
obj2.index is index
True
frame3
stateNevadaOhio
year
2000NaN1.5
20012.41.7
20022.93.6
'Ohio' in frame3.columns
True
2003 in frame3.index
False
2002 in frame3.index
True

基本功能

介绍操作Series和DataFrame中的数据的基本手段。

重新索引

pandas对象的一个重要方法是reindex,其作用是创建一个适应新索引的新对象。

obj=Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])
obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

调用该Series的reindex将会根据新索引重新排序。如果某个索引值当前不存在,就引入缺失值:

obj2=obj.reindex(['a','b','c','d','e'])
obj2
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
obj.reindex(['a','b','c','d','e'],fill_value=0)
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

对于时间序列这样的有序数据,重新索引可能需要做一些插值处理。method选项即可达到此目的,例如,使用ffill可以实现前向填充:

obj3=Series(['blue','purple','yellow'],index=[0,2,4])
obj3.reindex(range(6),method='bfill')
0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object

对于DataFrame,reindex可以修改(行)索引、列,或两个都修改,如果仅传入一个序列,则会重新索引行:

frame=DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],columns=['Ohio','Texas','California'])
frame
OhioTexasCalifornia
a012
c345
d678
frame2=frame.reindex(['a','b','c','d'])
frame2
OhioTexasCalifornia
a0.01.02.0
bNaNNaNNaN
c3.04.05.0
d6.07.08.0

使用columns关键字即可重新索引列:

states=['Texas','Utah','California']
frame.reindex(columns=states)
TexasUtahCalifornia
a1NaN2
c4NaN5
d7NaN8

也可以同时对行和列进行重新索引,而插值则只能按行应用(即轴0):

frame.reindex(index=['a','b','c','d'],method='ffill',columns=states)
TexasUtahCalifornia
a1NaN2
b1NaN2
c4NaN5
d7NaN8

利用ix的标签索引功能,重新索引任务可以变得更简洁:

frame.ix[['a','b','c','d'],states]
TexasUtahCalifornia
a1.0NaN2.0
bNaNNaNNaN
c4.0NaN5.0
d7.0NaN8.0

丢弃指定轴上的项

丢弃某条轴上的一个或多个项很简单,只要有一个索引数组或列表即可。drop方法返回的是一个在指定轴上删除了某条值的新对象:

obj=Series(np.arange(5.),index=['a','b','c','d','e'])
new_obj=obj.drop('c')
new_obj
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
obj.drop(['a','b'])
c    2.0
d    3.0
e    4.0
dtype: float64

DataFrame,可以删除任意值轴上的索引值:

data=DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
data.drop(['Colorado','Ohio'])
onetwothreefour
Utah891011
New York12131415
data.drop('two',axis=1)
onethreefour
Ohio023
Colorado467
Utah81011
New York121415
data.drop(['two','four'],axis=1)
onethree
Ohio02
Colorado46
Utah810
New York1214

索引、选取和过滤

obj=Series(np.arange(4.),index=['a','b','c','d'])
obj['b']
1.0
obj[1]
1.0
obj[2:4]
c    2.0
d    3.0
dtype: float64
obj[[1,3]]
b    1.0
d    3.0
dtype: float64
obj[obj<2]
a    0.0
b    1.0
dtype: float64
obj['b':'c']
b    1.0
c    2.0
dtype: float64
obj['b':'c']=5
obj
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64
data=DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
data
onetwothreefour
Ohio0123
Colorado4567
Utah891011
New York12131415
data[['three','one']]
threeone
Ohio20
Colorado64
Utah108
New York1412
data[:2]
onetwothreefour
Ohio0123
Colorado4567
data[data['three']>5]
onetwothreefour
Colorado4567
Utah891011
New York12131415

通过布尔型DataFrame进行索引:

data<5
onetwothreefour
OhioTrueTrueTrueTrue
ColoradoTrueFalseFalseFalse
UtahFalseFalseFalseFalse
New YorkFalseFalseFalseFalse
data[data<5]=0
data
onetwothreefour
Ohio0000
Colorado0567
Utah891011
New York12131415

索引字段ix,通过NumPy式的标记法以及轴标签从DataFrame中选取行和列的子集:

data.ix['Colorado',['two','three']]
two      5
three    6
Name: Colorado, dtype: int32
data.ix[['Colorado','Utah'],[3,0,1]]
fouronetwo
Colorado705
Utah1189
data.ix[2]
one       8
two       9
three    10
four     11
Name: Utah, dtype: int32
data.ix[:'Utah','two']  #注意这个冒号,是范围的意思,行
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32
data.ix[data.three>5,:3]  #冒号3是列的范围
onetwothree
Colorado056
Utah8910
New York121314

算术运算和数据对齐

pandas最重要的一个功能是,它可以对不同索引的对象进行算数运算:

s1=Series([7.3,-2.5,3.4,1.5],index=['a','c','d','e'])
s2=Series([-2.1,3.6,-1.5,4,3.1],index=['a','c','e','f','g'])
s1
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
s2
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64
s1+s2
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

对于DataFrame,对齐操作会同时发生在行和列上:

df1=DataFrame(np.arange(9.).reshape((3,3)),columns=list('bcd'),index=['Ohio','Texas','Colorado'])
df2=DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
df1
bcd
Ohio0.01.02.0
Texas3.04.05.0
Colorado6.07.08.0
df2
bde
Utah0.01.02.0
Ohio3.04.05.0
Texas6.07.08.0
Oregon9.010.011.0
df1+df2
bcde
ColoradoNaNNaNNaNNaN
Ohio3.0NaN6.0NaN
OregonNaNNaNNaNNaN
Texas9.0NaN12.0NaN
UtahNaNNaNNaNNaN
在算术方法中填充值

当一个对象中某个轴标签在另一个对象中找不到时填充一个特殊值(比如0):

df1=DataFrame(np.arange(12.).reshape((3,4)),columns=list('abcd'))
df2=DataFrame(np.arange(20.).reshape((4,5)),columns=list('abcde'))
df1
abcd
00.01.02.03.0
14.05.06.07.0
28.09.010.011.0
df2
abcde
00.01.02.03.04.0
15.06.07.08.09.0
210.011.012.013.014.0
315.016.017.018.019.0
df1+df2
abcde
00.02.04.06.0NaN
19.011.013.015.0NaN
218.020.022.024.0NaN
3NaNNaNNaNNaNNaN

使用df1的add方法,传入df2以及一个fill_value参数:

df1.add(df2,fill_value=0)
abcde
00.02.04.06.04.0
19.011.013.015.09.0
218.020.022.024.014.0
315.016.017.018.019.0

在对Series或DataFrame重新索引时,也可以指定一个填充值,对df1扩充列,该列是df2的列:

df1.reindex(columns=df2.columns,fill_value=0)
abcde
00.01.02.03.00
14.05.06.07.00
28.09.010.011.00
DataFrame和Series之间的运算

计算一个二维数组与其某行之间的差:

arr=np.arange(12.).reshape((3,4))
arr
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])
arr[0]
array([ 0.,  1.,  2.,  3.])
arr-arr[0]
array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

这就叫做广播(broadcasting)。DataFrame和Series之间的运算差不多也是如此:

frame=DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
series=frame.ix[0]
frame
bde
Utah0.01.02.0
Ohio3.04.05.0
Texas6.07.08.0
Oregon9.010.011.0
series
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

DataFrame和Series之间的算数运算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播:

frame-series
bde
Utah0.00.00.0
Ohio3.03.03.0
Texas6.06.06.0
Oregon9.09.09.0

如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集:

series2=Series(range(3),index=['b','e','f'])
series2
b    0
e    1
f    2
dtype: int32
frame+series2
bdef
Utah0.0NaN3.0NaN
Ohio3.0NaN6.0NaN
Texas6.0NaN9.0NaN
Oregon9.0NaN12.0NaN
series3=frame['d']
frame
bde
Utah0.01.02.0
Ohio3.04.05.0
Texas6.07.08.0
Oregon9.010.011.0
series3
Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64
frame.sub(series3,axis=0)
bde
Utah-1.00.01.0
Ohio-1.00.01.0
Texas-1.00.01.0
Oregon-1.00.01.0

函数应用和映射

NumPy的ufuncs(元素级数组方法)也可用于操作pandas对象

frame=DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
frame
bde
Utah0.8658120.2165221.041688
Ohio0.923866-0.768960-0.638414
Texas-1.3050170.531479-3.574036
Oregon-1.4268270.1141391.024634
np.abs(frame)
bde
Utah0.8658120.2165221.041688
Ohio0.9238660.7689600.638414
Texas1.3050170.5314793.574036
Oregon1.4268270.1141391.024634
f=lambda x:x.max()-x.min()
frame.apply(f)
b    2.350693
d    1.300438
e    4.615724
dtype: float64
frame.apply(f,axis=1)
Utah      0.825166
Ohio      1.692826
Texas     4.105514
Oregon    2.451461
dtype: float64
def f(x):
    return Series([x.min(),x.max()],index=['min','max'])
frame.apply(f)
bde
min-1.426827-0.768960-3.574036
max0.9238660.5314791.041688

使用applymap,得到frame中各浮点值的格式化字符串

format=lambda x:'%.2f' % x
frame.applymap(format)
bde
Utah0.870.221.04
Ohio0.92-0.77-0.64
Texas-1.310.53-3.57
Oregon-1.430.111.02
frame['e'].map(format)
Utah       1.04
Ohio      -0.64
Texas     -3.57
Oregon     1.02
Name: e, dtype: object

排序和命名

使用sort_index方法,对行或列索引进行排序(按字典排序)

obj=Series(range(4),index=['d','a','b','c'])
obj.sort_index()
a    1
b    2
c    3
d    0
dtype: int32

对于DataFrame,则可以根据任意一个轴上的索引进行排序:

frame=DataFrame(np.arange(8).reshape((2,4)),index=['three','one'],columns=['d','a','b','c'])
frame.sort_index()
dabc
one4567
three0123
frame.sort_index(axis=0)
dabc
one4567
three0123
frame.sort_index(axis=1)
abcd
three1230
one5674

默认是升序,也可以降序:

frame.sort_index(axis=1, ascending=False)
dcba
three0321
one4765

使用order方法,按值对Series进行排序

obj=Series([4,7,-3,2])
obj.order()
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: order is deprecated, use sort_values(...)
  from ipykernel import kernelapp as app



2   -3
3    2
0    4
1    7
dtype: int64
obj=Series([4,np.nan,7,np.nan,-3,2])
obj.order()
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: order is deprecated, use sort_values(...)
  from ipykernel import kernelapp as app




4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

将一个或多个列的名字传递给by选项即可根据一个或多个列中的值进行排序:

frame=DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
frame
ab
004
117
20-3
312
frame.sort_index(by='b')
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=…) if __name__ == ‘__main__’:
ab
20-3
312
004
117

根据多个列进行排序,传入名称的列表即可:

frame.sort_index(by=['a','b'])
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=…) if __name__ == ‘__main__’:
ab
20-3
004
312
117

Series和DataFrame的rank(排名)方法。默认情况下,rank是通过“为各组分配一个平均排名”的方式破坏平级关系:

obj=Series([7,-5,7,4,2,0,4])
obj.rank()
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64
obj.rank(method='first')
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64
obj.rank(ascending=False,method='max')
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

DataFrame可以在行或列上计算排名:

frame=DataFrame({'b':[4.3,7,-3,2],'a':[0,1,0,1],'c':[-2,5,8,-2.5]})
frame
abc
004.3-2.0
117.05.0
20-3.08.0
312.0-2.5
frame.rank(axis=1)
abc
02.03.01.0
11.03.02.0
22.01.03.0
32.03.01.0

带有重复值的轴索引

带有重复索引值的Series:

obj=Series(range(5),index=['a','a','b','b','c'])
obj
a    0
a    1
b    2
b    3
c    4
dtype: int32

索引的is_unique属性可以告诉你它的值是否是唯一的:

obj.index.is_unique
False 如果某个索引对应多个值,则返回一个Series;而对应单个值,则返回一个标量值。
obj['a']
a    0
a    1
dtype: int32
obj['c']
4

对DataFrame的行进行索引时也是如此:

df=DataFrame(np.random.randn(4,3),index=['a','a','b','b'])
df
012
a-1.1433320.282357-0.751916
a0.201693-0.8515450.915838
b0.542148-0.419044-0.540100
b1.162466-0.4250860.366470
df.ix['b']
012
b0.542148-0.419044-0.54010
b1.162466-0.4250860.36647

汇总和计算描述统计

pandas对象拥有一组常用的数学和统计方法,它们大部分都属于约简和汇总统计,用于从Series中提取单个值(如sum和mean)或从DataFrame的行或列中提取一个Series。跟对应的NumPy数组方法相比,它们都是基于没有缺失数据的假设而构建的。

df=DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],index=['a','b','c','d'],columns=['one','two'])
df
onetwo
a1.40NaN
b7.10-4.5
cNaNNaN
d0.75-1.3

调用DataFrame的sum方法将会返回一个含有列小计的Series:

df.sum()
one    9.25
two   -5.80
dtype: float64

传入axis=1将会按行进行求和运算:

df.sum(axis=1)
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

有些方法(如idxmin和idxmax)返回的是间接统计:

df.idxmax()
one    b
two    d
dtype: object

另一些方法则是累计型的:

df.cumsum()
onetwo
a1.40NaN
b8.50-4.5
cNaNNaN
d9.25-5.8

describe用于一次性产生多个汇总统计:

df.describe()
C:\Anaconda3\lib\site-packages\numpy\lib\function_base.py:3823: RuntimeWarning: Invalid value encountered in percentile RuntimeWarning)
onetwo
count3.0000002.000000
mean3.083333-2.900000
std3.4936852.262742
min0.750000-4.500000
25%NaNNaN
50%NaNNaN
75%NaNNaN
max7.100000-1.300000

对于非数值型数据,describe会产生另一种汇总统计:

obj=Series(['a','a','b','c']*4)
obj.describe()
count     16
unique     3
top        a
freq       8
dtype: object

相关系数与协方差

import pandas.io.data as web
C:\Anaconda3\lib\site-packages\pandas\io\data.py:35: FutureWarning: 
The pandas.io.data module is moved to a separate package (pandas-datareader) and will be removed from pandas in a future version.
After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.
  FutureWarning)
all_data = {}
for ticker in ['AAPL','IBM','MSFT','GOOG']:
    all_data[ticker]=web.DataReader(ticker,'yahoo','1/1/2000','1/1/2010')
price = DataFrame({tic: data['Adj Close'] 
                       for tic, data in all_data.iteritems()})
volume = DataFrame({tic: data['Volume'] 
                    for tic, data in all_data.iteritems()})
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-145-c25e66e2aee1> in <module>()
      1 price = DataFrame({tic: data['Adj Close'] 
----> 2                        for tic, data in all_data.iteritems()})
      3 volume = DataFrame({tic: data['Volume'] 
      4                     for tic, data in all_data.iteritems()})


AttributeError: 'dict' object has no attribute 'iteritems'

总是报错,不知道原因。

唯一值、值计数以及成员资格

obj=Series(['c','a','d','a','a','b','b','c','c'])
uniques=obj.unique()
uniques
array(['c', 'a', 'd', 'b'], dtype=object)

value_counts用于计算一个Series中各值出现的频率:

obj.value_counts()
c    3
a    3
b    2
d    1
dtype: int64
pd.value_counts(obj.values,sort=False)
a    3
d    1
b    2
c    3
dtype: int64

isin,用于判断矢量化集合的会员资格,可用于选取Series中或DataFrame列中数据的子集:

mask=obj.isin(['b','c'])
mask
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
obj[mask]
0    c
5    b
6    b
7    c
8    c
dtype: object
data=DataFrame({'Qu1':[1,3,4,3,4],
                'Qu2':[2,3,1,2,3],
               'Qu3':[1,5,2,4,4]})
data
Qu1Qu2Qu3
0121
1335
2412
3324
4434
result=data.apply(pd.value_counts).fillna(0)
result
Qu1Qu2Qu3
11.01.01.0
20.02.01.0
32.02.00.0
42.00.02.0
50.00.01.0

处理缺失数据

pandas使用浮点值NaN(Not a Number)表示浮点和非浮点数组中的缺失数据,它只是一个便于被检测出来的标记而已:

string_data=Series(['aardvark','artichoke',np.nan,'avocado'])
string_data
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object
string_data.isnull()
0    False
1    False
2     True
3    False
dtype: bool
string_data[0]=None
string_data.isnull()
0     True
1    False
2     True
3    False
dtype: bool

滤除缺失数据

dropna用于过滤掉缺失数据,对于Series,dropna返回一个仅含非空数据和索引值的Series:

from numpy import nan as NA
data=Series([1,NA,3.5,NA,7])
data.dropna()
0    1.0
2    3.5
4    7.0
dtype: float64
data[data.notnull()]
0    1.0
2    3.5
4    7.0
dtype: float64

对于DataFrame对象,dropna默认丢弃任何含有缺失值的行:

data=DataFrame([[1.,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,6.5,3.]])
cleaned=data.dropna()
data
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
cleaned
012
01.06.53.0
data.dropna(how='all')
012
01.06.53.0
11.0NaNNaN
3NaN6.53.0
data[4]=NA
data
0124
01.06.53.0NaN
11.0NaNNaNNaN
2NaNNaNNaNNaN
3NaN6.53.0NaN
data.dropna(axis=1,how='all')
012
01.06.53.0
11.0NaNNaN
2NaNNaNNaN
3NaN6.53.0
df=DataFrame(np.random.randn(7,3))
df.ix[:4,1]=NA
df.ix[:2,2]=NA
df
012
0-0.625413NaNNaN
1-0.149081NaNNaN
20.508126NaNNaN
31.302235NaN0.779230
40.314148NaN-0.354226
5-2.301367-0.016120-0.350993
6-0.810814-1.3522871.630001
df.dropna(thresh=3)
012
5-2.301367-0.016120-0.350993
6-0.810814-1.3522871.630001

填充缺失数据

通过一个常用数调用fillna就会将缺失值替换为那个常数值:

df.fillna(0)
012
0-0.6254130.0000000.000000
1-0.1490810.0000000.000000
20.5081260.0000000.000000
31.3022350.0000000.779230
40.3141480.000000-0.354226
5-2.301367-0.016120-0.350993
6-0.810814-1.3522871.630001

若是通过一个字典调用fillna,就可以实现对不同的列填充不同的值:

df.fillna({1:0.5,3:-1})
012
0-0.6254130.500000NaN
1-0.1490810.500000NaN
20.5081260.500000NaN
31.3022350.5000000.779230
40.3141480.500000-0.354226
5-2.301367-0.016120-0.350993
6-0.810814-1.3522871.630001
df
012
0-0.625413NaNNaN
1-0.149081NaNNaN
20.508126NaNNaN
31.302235NaN0.779230
40.314148NaN-0.354226
5-2.301367-0.016120-0.350993
6-0.810814-1.3522871.630001
#总是返回被填充对象的引用
_=df.fillna(0,inplace=True)
df
012
0-0.6254130.0000000.000000
1-0.1490810.0000000.000000
20.5081260.0000000.000000
31.3022350.0000000.779230
40.3141480.000000-0.354226
5-2.301367-0.016120-0.350993
6-0.810814-1.3522871.630001

对reindex有效的那些插值方法也可用于fillna

df=DataFrame(np.random.randn(6,3))
df.ix[2:,1]=NA
df.ix[4:,2]=NA
df
012
00.7958371.0388401.805921
1-0.990684-0.5429181.262955
20.215151NaN1.680489
3-1.881573NaN-0.301868
40.491152NaNNaN
50.258699NaNNaN
df.fillna(method='ffill')
012
00.7958371.0388401.805921
1-0.990684-0.5429181.262955
20.215151-0.5429181.680489
3-1.881573-0.542918-0.301868
40.491152-0.542918-0.301868
50.258699-0.542918-0.301868
df.fillna(method='ffill',limit=2)
012
00.7958371.0388401.805921
1-0.990684-0.5429181.262955
20.215151-0.5429181.680489
3-1.881573-0.542918-0.301868
40.491152NaN-0.301868
50.258699NaN-0.301868

利用fillna传入Series的平均值或中位数:

data=Series([1.,NA,3.5,NA,7])
data
0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64
data.mean()
3.8333333333333335
data.fillna(data.mean())
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

层次化索引

层次化索引(hierarchical indexing),使你能在一个轴上拥有多个(两个以上)索引级别。

data=Series(np.random.randn(10),index=[['a','a','a','b','b','b','c','c','d','d'],[1,2,3,1,2,3,1,2,2,3]])
data
a  1    0.091014
   2    1.542964
   3   -0.287869
b  1    1.551622
   2   -2.931760
   3    0.751749
c  1    1.660620
   2   -1.493720
d  2    0.718965
   3    0.826192
dtype: float64

这就是带有MultiIndex索引的Series的格式化输出形式:

data.index
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
data['b']
1    1.551622
2   -2.931760
3    0.751749
dtype: float64
data['b':'c']
b  1    1.551622
   2   -2.931760
   3    0.751749
c  1    1.660620
   2   -1.493720
dtype: float64
data.ix[['b','d']]
b  1    1.551622
   2   -2.931760
   3    0.751749
d  2    0.718965
   3    0.826192
dtype: float64

在“内层”中进行选取

data[:,2]
a    1.542964
b   -2.931760
c   -1.493720
d    0.718965
dtype: float64

这段数据可以通过其unstack方法重新安排到一个DataFrame中:

data.unstack()
123
a0.0910141.542964-0.287869
b1.551622-2.9317600.751749
c1.660620-1.493720NaN
dNaN0.7189650.826192

unstack的逆运算是stack:

data.unstack().stack()
a  1    0.091014
   2    1.542964
   3   -0.287869
b  1    1.551622
   2   -2.931760
   3    0.751749
c  1    1.660620
   2   -1.493720
d  2    0.718965
   3    0.826192
dtype: float64

对于一个DataFrame,每条轴都可以有分层索引:

frame=DataFrame(np.arange(12).reshape((4,3)),index=[['a','a','b','b'],[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])
frame
OhioColorado
GreenRedGreen
a1012
2345
b1678
291011
frame.index.names=['key1','key2']
frame.columns.names=['state','color']
frame
stateOhioColorado
colorGreenRedGreen
key1key2
a1012
2345
b1678
291011
frame['Ohio']
colorGreenRed
key1key2
a101
234
b167
2910

重排分级顺序

swaplevel接受两个级别编号或名称,并返回一个互换了级别的新对象(但数据不会发生变化):

frame.swaplevel('key1','key2')
stateOhioColorado
colorGreenRedGreen
key2key1
1a012
2a345
1b678
2b91011

sortlevel根据单个级别中的值对数据进行排序(稳定的)。交换级别时,常常也会用到sortlevel,这样最终结果就是有序的了:

frame.sortlevel(1)
stateOhioColorado
colorGreenRedGreen
key1key2
a1012
b1678
a2345
b291011
frame.swaplevel(0,1).sortlevel(0)
stateOhioColorado
colorGreenRedGreen
key2key1
1a012
b678
2a345
b91011

根据级别汇总统计

许多对DataFrame和Series的描述和汇总统计都有一个level选项,它用于指定在某条轴上求和的级别:

frame.sum(level='key2')
stateOhioColorado
colorGreenRedGreen
key2
16810
2121416
frame.sum(level='color',axis=1)
colorGreenRed
key1key2
a121
284
b1147
22010

使用DataFrame的列

将DataFrame的一个或多个列当做行索引来用,或者可能希望将行索引变成DataFrame的列:

frame=DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one','one','one','two','two','two','two'],'d':[0,1,2,0,1,2,3]})
frame
abcd
007one0
116one1
225one2
334two0
443two1
552two2
661two3

DataFrame的set_index函数会将其一个或多个列转换为行索引,并创建一个新的DataFrame:

frame2=frame.set_index(['c','d'])
frame2
ab
cd
one007
116
225
two034
143
252
361
frame.set_index(['c','d'],drop=False)
abcd
cd
one007one0
116one1
225one2
two034two0
143two1
252two2
361two3

reset_index的功能跟set_index刚好相反,层次化索引的级别会被转移到列里面:

frame2.reset_index()
cdab
0one007
1one116
2one225
3two034
4two143
5two252
6two361

其他有关pandas的话题

整数索引

如果你需要可靠的、不考虑索引类型的、基于位置的索引,可以使用Series的iget_value方法和DataFrame的irow和icol方法:

ser3=Series(range(3),index=[-5,1,3])
ser3.iget_value(2)
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: iget_value(i) is deprecated. Please use .iloc[i] or .iat[i]
  from ipykernel import kernelapp as app


2
frame=DataFrame(np.arange(6).reshape(3,2),index=[2,0,1])
frame
01
201
023
145
frame.irow(0)
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: irow(i) is deprecated. Please use .iloc[i]
  if __name__ == '__main__':





0    0
1    1
Name: 2, dtype: int32

面板数据

pandas有一个Panel数据结构,可以将其看作一个三维版的DataFrame,可以用一个由DataFrame对象组成的字典或一个三位ndarray来创建Panel对象:

import pandas.io.data as web
pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk))
                       for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL']))
pdata
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 14 (major_axis) x 0 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: </html> to <script language=javascript type="text/javascript">
Minor_axis axis: None

始终报错。

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值