第5章 pandas入门
pandas是基于NumPy构建的,让以NumPy为中心的应用变得更加简单。
pandas的数据结构介绍
两个主要数据结构:Series和DataFrame
Series
由一组数据以及一组与之相关的数据标签(即索引)组成
from pandas import Series,DataFrame
import numpy as np
import pandas as pd
obj=Series([4,7,-5,3])
obj
0 4
1 7
2 -5
3 3
dtype: int64
obj.values
array([ 4, 7, -5, 3], dtype=int64)
obj.index
RangeIndex(start=0, stop=4, step=1)
带有对各个数据点进行标记的索引:
obj2=Series([4,7,-5,3],index=['d','b','a','c'])
obj2
d 4
b 7
a -5
c 3
dtype: int64
obj2.index
Index(['d', 'b', 'a', 'c'], dtype='object')
通过索引的方式选取Series中的单个或一组值:
obj2['a']
-5
obj2['d']=6
obj2
d 6
b 7
a -5
c 3
dtype: int64
obj2[['c','a','d']]
c 3
a -5
d 6
dtype: int64
NumPy数组运算会保留索引和值之间的链接:
obj2[obj2>0]
d 6
b 7
c 3
dtype: int64
obj2>0
d True
b True
a False
c True
dtype: bool
obj2*2
d 12
b 14
a -10
c 6
dtype: int64
np.exp(obj2)
d 403.428793
b 1096.633158
a 0.006738
c 20.085537
dtype: float64
可以将Series看成是一个定长的有序字典
'b' in obj2
True
'e' in obj2
False
如果数据被存放一个Python字典中,也可以直接通过这个字典来创建Series:
sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
obj3=Series(sdata)
obj3
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
sdata
{'Ohio': 35000, 'Oregon': 16000, 'Texas': 71000, 'Utah': 5000}
states=['California','Ohio','Oregon','Texas']
obj4=Series(sdata,index=states)
obj4
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
pandas的isnull和notnull函数可用于检测缺失数据:
pd.isnull(obj4)
California True
Ohio False
Oregon False
Texas False
dtype: bool
pd.notnull(obj4)
California False
Ohio True
Oregon True
Texas True
dtype: bool
obj4.isnull()
California True
Ohio False
Oregon False
Texas False
dtype: bool
Series最重要的一个功能就是:它在算术运算中会自动对齐不同索引的数据。
obj3
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
obj4
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
obj3+obj4
California NaN
Ohio 70000.0
Oregon 32000.0
Texas 142000.0
Utah NaN
dtype: float64
Series对象本身及其索引都有一个name属性,该属性跟pandas其他的关键功能关系非常密切:
obj4.name='population'
obj4.index.name='state'
obj4
state
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
Series的索引可以通过赋值的方式就地修改:
obj.index=['Bob','Steve','Jeff','Ryan']
obj
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
DataFrame
data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]}
frame=DataFrame(data)
frame
pop | state | year | |
---|---|---|---|
0 | 1.5 | Ohio | 2000 |
1 | 1.7 | Ohio | 2001 |
2 | 3.6 | Ohio | 2002 |
3 | 2.4 | Nevada | 2001 |
4 | 2.9 | Nevada | 2002 |
按照指定顺序进行排列:
DataFrame(data,columns=['year','state','pop'])
year | state | pop | |
---|---|---|---|
0 | 2000 | Ohio | 1.5 |
1 | 2001 | Ohio | 1.7 |
2 | 2002 | Ohio | 3.6 |
3 | 2001 | Nevada | 2.4 |
4 | 2002 | Nevada | 2.9 |
如果传入的列在数据中找不到,就会产生NA值
frame2=DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five'])
frame2
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohio | 1.5 | NaN |
two | 2001 | Ohio | 1.7 | NaN |
three | 2002 | Ohio | 3.6 | NaN |
four | 2001 | Nevada | 2.4 | NaN |
five | 2002 | Nevada | 2.9 | NaN |
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')
frame2['state']
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
Name: state, dtype: object
frame2.year
one 2000
two 2001
three 2002
four 2001
five 2002
Name: year, dtype: int64
返回的Series拥有原DataFrame相同的索引,行也可以通过位置或名称的方式进行获取,比如用索引字段ix
frame2.ix['three']
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
列可以通过赋值的方式进行修改。
frame2['debt']=16.5
frame2
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohio | 1.5 | 16.5 |
two | 2001 | Ohio | 1.7 | 16.5 |
three | 2002 | Ohio | 3.6 | 16.5 |
four | 2001 | Nevada | 2.4 | 16.5 |
five | 2002 | Nevada | 2.9 | 16.5 |
frame2['debt']=np.arange(5.)
frame2
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohio | 1.5 | 0.0 |
two | 2001 | Ohio | 1.7 | 1.0 |
three | 2002 | Ohio | 3.6 | 2.0 |
four | 2001 | Nevada | 2.4 | 3.0 |
five | 2002 | Nevada | 2.9 | 4.0 |
将列表或数组赋值给某个列时,其长度必须跟DataFrame的长度相匹配,如果赋值的是一个Series,就会精确匹配DataFrame的索引,所有的空位都将被填上缺失值:
val=Series([-1.2,-1.5,-1.7],index=['two','four','five'])
frame2['debt']=val
frame2
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohio | 1.5 | NaN |
two | 2001 | Ohio | 1.7 | -1.2 |
three | 2002 | Ohio | 3.6 | NaN |
four | 2001 | Nevada | 2.4 | -1.5 |
five | 2002 | Nevada | 2.9 | -1.7 |
为不存在的列赋值会创建一个新列,关键字del用于删除列
frame2['eastern']=frame2.state=='Ohio'
frame2
year | state | pop | debt | eastern | |
---|---|---|---|---|---|
one | 2000 | Ohio | 1.5 | NaN | True |
two | 2001 | Ohio | 1.7 | -1.2 | True |
three | 2002 | Ohio | 3.6 | NaN | True |
four | 2001 | Nevada | 2.4 | -1.5 | False |
five | 2002 | Nevada | 2.9 | -1.7 | False |
del frame2['eastern']
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')
对返回的Series所做的任何就地修改全都会反映到源DataFrame上。
另一种常见的数据形式是嵌套字典:
pop={'Nevada':{2001:2.4,2002:2.9},
'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame3=DataFrame(pop)
frame3
Nevada | Ohio | |
---|---|---|
2000 | NaN | 1.5 |
2001 | 2.4 | 1.7 |
2002 | 2.9 | 3.6 |
外层字典的键作为列,内层键作为行索引
frame3.T #转置
2000 | 2001 | 2002 | |
---|---|---|---|
Nevada | NaN | 2.4 | 2.9 |
Ohio | 1.5 | 1.7 | 3.6 |
DataFrame(pop,index=[2001,2002,2003])
Nevada | Ohio | |
---|---|---|
2001 | 2.4 | 1.7 |
2002 | 2.9 | 3.6 |
2003 | NaN | NaN |
由Series组成的字典差不多也是一样的用法:
pdata={'Ohio':frame3['Ohio'][:-1],
'Nevada':frame3['Nevada'][:2]}
DataFrame(pdata)
Nevada | Ohio | |
---|---|---|
2000 | NaN | 1.5 |
2001 | 2.4 | 1.7 |
frame3.index.name='year'
frame3.columns.name='state'
frame3
state | Nevada | Ohio |
---|---|---|
year | ||
2000 | NaN | 1.5 |
2001 | 2.4 | 1.7 |
2002 | 2.9 | 3.6 |
frame3.values
array([[ nan, 1.5],
[ 2.4, 1.7],
[ 2.9, 3.6]])
frame2.values
array([[2000, 'Ohio', 1.5, nan],
[2001, 'Ohio', 1.7, -1.2],
[2002, 'Ohio', 3.6, nan],
[2001, 'Nevada', 2.4, -1.5],
[2002, 'Nevada', 2.9, -1.7]], dtype=object)
索引对象
pandas的索引对象负责管理轴标签和其他元数据(比如轴名称等)。构建Series或DataFrame时,所用到的任何数组或其他序列的标签都会被转换成一个index:
obj=Series(range(3),index=['a','b','c'])
index=obj.index
index
Index(['a', 'b', 'c'], dtype='object')
index[1:]
Index(['b', 'c'], dtype='object')
Index对象是不可修改的(immutable),因此用户不能对其进行修改。不可修改性非常重要,因为这样才能使Index对象在多个数据结构之间安全共享:
index=pd.Index(np.arange(3))
obj2=Series([1.5,-2.5,0],index=index)
obj2
0 1.5
1 -2.5
2 0.0
dtype: float64
obj2.index is index
True
frame3
state | Nevada | Ohio |
---|---|---|
year | ||
2000 | NaN | 1.5 |
2001 | 2.4 | 1.7 |
2002 | 2.9 | 3.6 |
'Ohio' in frame3.columns
True
2003 in frame3.index
False
2002 in frame3.index
True
基本功能
介绍操作Series和DataFrame中的数据的基本手段。
重新索引
pandas对象的一个重要方法是reindex,其作用是创建一个适应新索引的新对象。
obj=Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])
obj
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
调用该Series的reindex将会根据新索引重新排序。如果某个索引值当前不存在,就引入缺失值:
obj2=obj.reindex(['a','b','c','d','e'])
obj2
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
obj.reindex(['a','b','c','d','e'],fill_value=0)
a -5.3
b 7.2
c 3.6
d 4.5
e 0.0
dtype: float64
对于时间序列这样的有序数据,重新索引可能需要做一些插值处理。method选项即可达到此目的,例如,使用ffill可以实现前向填充:
obj3=Series(['blue','purple','yellow'],index=[0,2,4])
obj3.reindex(range(6),method='bfill')
0 blue
1 purple
2 purple
3 yellow
4 yellow
5 NaN
dtype: object
对于DataFrame,reindex可以修改(行)索引、列,或两个都修改,如果仅传入一个序列,则会重新索引行:
frame=DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],columns=['Ohio','Texas','California'])
frame
Ohio | Texas | California | |
---|---|---|---|
a | 0 | 1 | 2 |
c | 3 | 4 | 5 |
d | 6 | 7 | 8 |
frame2=frame.reindex(['a','b','c','d'])
frame2
Ohio | Texas | California | |
---|---|---|---|
a | 0.0 | 1.0 | 2.0 |
b | NaN | NaN | NaN |
c | 3.0 | 4.0 | 5.0 |
d | 6.0 | 7.0 | 8.0 |
使用columns关键字即可重新索引列:
states=['Texas','Utah','California']
frame.reindex(columns=states)
Texas | Utah | California | |
---|---|---|---|
a | 1 | NaN | 2 |
c | 4 | NaN | 5 |
d | 7 | NaN | 8 |
也可以同时对行和列进行重新索引,而插值则只能按行应用(即轴0):
frame.reindex(index=['a','b','c','d'],method='ffill',columns=states)
Texas | Utah | California | |
---|---|---|---|
a | 1 | NaN | 2 |
b | 1 | NaN | 2 |
c | 4 | NaN | 5 |
d | 7 | NaN | 8 |
利用ix的标签索引功能,重新索引任务可以变得更简洁:
frame.ix[['a','b','c','d'],states]
Texas | Utah | California | |
---|---|---|---|
a | 1.0 | NaN | 2.0 |
b | NaN | NaN | NaN |
c | 4.0 | NaN | 5.0 |
d | 7.0 | NaN | 8.0 |
丢弃指定轴上的项
丢弃某条轴上的一个或多个项很简单,只要有一个索引数组或列表即可。drop方法返回的是一个在指定轴上删除了某条值的新对象:
obj=Series(np.arange(5.),index=['a','b','c','d','e'])
new_obj=obj.drop('c')
new_obj
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
obj.drop(['a','b'])
c 2.0
d 3.0
e 4.0
dtype: float64
DataFrame,可以删除任意值轴上的索引值:
data=DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
data.drop(['Colorado','Ohio'])
one | two | three | four | |
---|---|---|---|---|
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
data.drop('two',axis=1)
one | three | four | |
---|---|---|---|
Ohio | 0 | 2 | 3 |
Colorado | 4 | 6 | 7 |
Utah | 8 | 10 | 11 |
New York | 12 | 14 | 15 |
data.drop(['two','four'],axis=1)
one | three | |
---|---|---|
Ohio | 0 | 2 |
Colorado | 4 | 6 |
Utah | 8 | 10 |
New York | 12 | 14 |
索引、选取和过滤
obj=Series(np.arange(4.),index=['a','b','c','d'])
obj['b']
1.0
obj[1]
1.0
obj[2:4]
c 2.0
d 3.0
dtype: float64
obj[[1,3]]
b 1.0
d 3.0
dtype: float64
obj[obj<2]
a 0.0
b 1.0
dtype: float64
obj['b':'c']
b 1.0
c 2.0
dtype: float64
obj['b':'c']=5
obj
a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64
data=DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
data
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
data[['three','one']]
three | one | |
---|---|---|
Ohio | 2 | 0 |
Colorado | 6 | 4 |
Utah | 10 | 8 |
New York | 14 | 12 |
data[:2]
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
data[data['three']>5]
one | two | three | four | |
---|---|---|---|---|
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
通过布尔型DataFrame进行索引:
data<5
one | two | three | four | |
---|---|---|---|---|
Ohio | True | True | True | True |
Colorado | True | False | False | False |
Utah | False | False | False | False |
New York | False | False | False | False |
data[data<5]=0
data
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 0 | 0 | 0 |
Colorado | 0 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
索引字段ix,通过NumPy式的标记法以及轴标签从DataFrame中选取行和列的子集:
data.ix['Colorado',['two','three']]
two 5
three 6
Name: Colorado, dtype: int32
data.ix[['Colorado','Utah'],[3,0,1]]
four | one | two | |
---|---|---|---|
Colorado | 7 | 0 | 5 |
Utah | 11 | 8 | 9 |
data.ix[2]
one 8
two 9
three 10
four 11
Name: Utah, dtype: int32
data.ix[:'Utah','two'] #注意这个冒号,是范围的意思,行
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int32
data.ix[data.three>5,:3] #冒号3是列的范围
one | two | three | |
---|---|---|---|
Colorado | 0 | 5 | 6 |
Utah | 8 | 9 | 10 |
New York | 12 | 13 | 14 |
算术运算和数据对齐
pandas最重要的一个功能是,它可以对不同索引的对象进行算数运算:
s1=Series([7.3,-2.5,3.4,1.5],index=['a','c','d','e'])
s2=Series([-2.1,3.6,-1.5,4,3.1],index=['a','c','e','f','g'])
s1
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
s2
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
s1+s2
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
对于DataFrame,对齐操作会同时发生在行和列上:
df1=DataFrame(np.arange(9.).reshape((3,3)),columns=list('bcd'),index=['Ohio','Texas','Colorado'])
df2=DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
df1
b | c | d | |
---|---|---|---|
Ohio | 0.0 | 1.0 | 2.0 |
Texas | 3.0 | 4.0 | 5.0 |
Colorado | 6.0 | 7.0 | 8.0 |
df2
b | d | e | |
---|---|---|---|
Utah | 0.0 | 1.0 | 2.0 |
Ohio | 3.0 | 4.0 | 5.0 |
Texas | 6.0 | 7.0 | 8.0 |
Oregon | 9.0 | 10.0 | 11.0 |
df1+df2
b | c | d | e | |
---|---|---|---|---|
Colorado | NaN | NaN | NaN | NaN |
Ohio | 3.0 | NaN | 6.0 | NaN |
Oregon | NaN | NaN | NaN | NaN |
Texas | 9.0 | NaN | 12.0 | NaN |
Utah | NaN | NaN | NaN | NaN |
在算术方法中填充值
当一个对象中某个轴标签在另一个对象中找不到时填充一个特殊值(比如0):
df1=DataFrame(np.arange(12.).reshape((3,4)),columns=list('abcd'))
df2=DataFrame(np.arange(20.).reshape((4,5)),columns=list('abcde'))
df1
a | b | c | d | |
---|---|---|---|---|
0 | 0.0 | 1.0 | 2.0 | 3.0 |
1 | 4.0 | 5.0 | 6.0 | 7.0 |
2 | 8.0 | 9.0 | 10.0 | 11.0 |
df2
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 1.0 | 2.0 | 3.0 | 4.0 |
1 | 5.0 | 6.0 | 7.0 | 8.0 | 9.0 |
2 | 10.0 | 11.0 | 12.0 | 13.0 | 14.0 |
3 | 15.0 | 16.0 | 17.0 | 18.0 | 19.0 |
df1+df2
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 2.0 | 4.0 | 6.0 | NaN |
1 | 9.0 | 11.0 | 13.0 | 15.0 | NaN |
2 | 18.0 | 20.0 | 22.0 | 24.0 | NaN |
3 | NaN | NaN | NaN | NaN | NaN |
使用df1的add方法,传入df2以及一个fill_value参数:
df1.add(df2,fill_value=0)
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 2.0 | 4.0 | 6.0 | 4.0 |
1 | 9.0 | 11.0 | 13.0 | 15.0 | 9.0 |
2 | 18.0 | 20.0 | 22.0 | 24.0 | 14.0 |
3 | 15.0 | 16.0 | 17.0 | 18.0 | 19.0 |
在对Series或DataFrame重新索引时,也可以指定一个填充值,对df1扩充列,该列是df2的列:
df1.reindex(columns=df2.columns,fill_value=0)
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 1.0 | 2.0 | 3.0 | 0 |
1 | 4.0 | 5.0 | 6.0 | 7.0 | 0 |
2 | 8.0 | 9.0 | 10.0 | 11.0 | 0 |
DataFrame和Series之间的运算
计算一个二维数组与其某行之间的差:
arr=np.arange(12.).reshape((3,4))
arr
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
arr[0]
array([ 0., 1., 2., 3.])
arr-arr[0]
array([[ 0., 0., 0., 0.],
[ 4., 4., 4., 4.],
[ 8., 8., 8., 8.]])
这就叫做广播(broadcasting)。DataFrame和Series之间的运算差不多也是如此:
frame=DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
series=frame.ix[0]
frame
b | d | e | |
---|---|---|---|
Utah | 0.0 | 1.0 | 2.0 |
Ohio | 3.0 | 4.0 | 5.0 |
Texas | 6.0 | 7.0 | 8.0 |
Oregon | 9.0 | 10.0 | 11.0 |
series
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
DataFrame和Series之间的算数运算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播:
frame-series
b | d | e | |
---|---|---|---|
Utah | 0.0 | 0.0 | 0.0 |
Ohio | 3.0 | 3.0 | 3.0 |
Texas | 6.0 | 6.0 | 6.0 |
Oregon | 9.0 | 9.0 | 9.0 |
如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集:
series2=Series(range(3),index=['b','e','f'])
series2
b 0
e 1
f 2
dtype: int32
frame+series2
b | d | e | f | |
---|---|---|---|---|
Utah | 0.0 | NaN | 3.0 | NaN |
Ohio | 3.0 | NaN | 6.0 | NaN |
Texas | 6.0 | NaN | 9.0 | NaN |
Oregon | 9.0 | NaN | 12.0 | NaN |
series3=frame['d']
frame
b | d | e | |
---|---|---|---|
Utah | 0.0 | 1.0 | 2.0 |
Ohio | 3.0 | 4.0 | 5.0 |
Texas | 6.0 | 7.0 | 8.0 |
Oregon | 9.0 | 10.0 | 11.0 |
series3
Utah 1.0
Ohio 4.0
Texas 7.0
Oregon 10.0
Name: d, dtype: float64
frame.sub(series3,axis=0)
b | d | e | |
---|---|---|---|
Utah | -1.0 | 0.0 | 1.0 |
Ohio | -1.0 | 0.0 | 1.0 |
Texas | -1.0 | 0.0 | 1.0 |
Oregon | -1.0 | 0.0 | 1.0 |
函数应用和映射
NumPy的ufuncs(元素级数组方法)也可用于操作pandas对象
frame=DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
frame
b | d | e | |
---|---|---|---|
Utah | 0.865812 | 0.216522 | 1.041688 |
Ohio | 0.923866 | -0.768960 | -0.638414 |
Texas | -1.305017 | 0.531479 | -3.574036 |
Oregon | -1.426827 | 0.114139 | 1.024634 |
np.abs(frame)
b | d | e | |
---|---|---|---|
Utah | 0.865812 | 0.216522 | 1.041688 |
Ohio | 0.923866 | 0.768960 | 0.638414 |
Texas | 1.305017 | 0.531479 | 3.574036 |
Oregon | 1.426827 | 0.114139 | 1.024634 |
f=lambda x:x.max()-x.min()
frame.apply(f)
b 2.350693
d 1.300438
e 4.615724
dtype: float64
frame.apply(f,axis=1)
Utah 0.825166
Ohio 1.692826
Texas 4.105514
Oregon 2.451461
dtype: float64
def f(x):
return Series([x.min(),x.max()],index=['min','max'])
frame.apply(f)
b | d | e | |
---|---|---|---|
min | -1.426827 | -0.768960 | -3.574036 |
max | 0.923866 | 0.531479 | 1.041688 |
使用applymap,得到frame中各浮点值的格式化字符串
format=lambda x:'%.2f' % x
frame.applymap(format)
b | d | e | |
---|---|---|---|
Utah | 0.87 | 0.22 | 1.04 |
Ohio | 0.92 | -0.77 | -0.64 |
Texas | -1.31 | 0.53 | -3.57 |
Oregon | -1.43 | 0.11 | 1.02 |
frame['e'].map(format)
Utah 1.04
Ohio -0.64
Texas -3.57
Oregon 1.02
Name: e, dtype: object
排序和命名
使用sort_index方法,对行或列索引进行排序(按字典排序)
obj=Series(range(4),index=['d','a','b','c'])
obj.sort_index()
a 1
b 2
c 3
d 0
dtype: int32
对于DataFrame,则可以根据任意一个轴上的索引进行排序:
frame=DataFrame(np.arange(8).reshape((2,4)),index=['three','one'],columns=['d','a','b','c'])
frame.sort_index()
d | a | b | c | |
---|---|---|---|---|
one | 4 | 5 | 6 | 7 |
three | 0 | 1 | 2 | 3 |
frame.sort_index(axis=0)
d | a | b | c | |
---|---|---|---|---|
one | 4 | 5 | 6 | 7 |
three | 0 | 1 | 2 | 3 |
frame.sort_index(axis=1)
a | b | c | d | |
---|---|---|---|---|
three | 1 | 2 | 3 | 0 |
one | 5 | 6 | 7 | 4 |
默认是升序,也可以降序:
frame.sort_index(axis=1, ascending=False)
d | c | b | a | |
---|---|---|---|---|
three | 0 | 3 | 2 | 1 |
one | 4 | 7 | 6 | 5 |
使用order方法,按值对Series进行排序
obj=Series([4,7,-3,2])
obj.order()
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: order is deprecated, use sort_values(...)
from ipykernel import kernelapp as app
2 -3
3 2
0 4
1 7
dtype: int64
obj=Series([4,np.nan,7,np.nan,-3,2])
obj.order()
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: order is deprecated, use sort_values(...)
from ipykernel import kernelapp as app
4 -3.0
5 2.0
0 4.0
2 7.0
1 NaN
3 NaN
dtype: float64
将一个或多个列的名字传递给by选项即可根据一个或多个列中的值进行排序:
frame=DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
frame
a | b | |
---|---|---|
0 | 0 | 4 |
1 | 1 | 7 |
2 | 0 | -3 |
3 | 1 | 2 |
frame.sort_index(by='b')
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=…) if __name__ == ‘__main__’:
a | b | |
---|---|---|
2 | 0 | -3 |
3 | 1 | 2 |
0 | 0 | 4 |
1 | 1 | 7 |
根据多个列进行排序,传入名称的列表即可:
frame.sort_index(by=['a','b'])
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=…) if __name__ == ‘__main__’:
a | b | |
---|---|---|
2 | 0 | -3 |
0 | 0 | 4 |
3 | 1 | 2 |
1 | 1 | 7 |
Series和DataFrame的rank(排名)方法。默认情况下,rank是通过“为各组分配一个平均排名”的方式破坏平级关系:
obj=Series([7,-5,7,4,2,0,4])
obj.rank()
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
obj.rank(method='first')
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
obj.rank(ascending=False,method='max')
0 2.0
1 7.0
2 2.0
3 4.0
4 5.0
5 6.0
6 4.0
dtype: float64
DataFrame可以在行或列上计算排名:
frame=DataFrame({'b':[4.3,7,-3,2],'a':[0,1,0,1],'c':[-2,5,8,-2.5]})
frame
a | b | c | |
---|---|---|---|
0 | 0 | 4.3 | -2.0 |
1 | 1 | 7.0 | 5.0 |
2 | 0 | -3.0 | 8.0 |
3 | 1 | 2.0 | -2.5 |
frame.rank(axis=1)
a | b | c | |
---|---|---|---|
0 | 2.0 | 3.0 | 1.0 |
1 | 1.0 | 3.0 | 2.0 |
2 | 2.0 | 1.0 | 3.0 |
3 | 2.0 | 3.0 | 1.0 |
带有重复值的轴索引
带有重复索引值的Series:
obj=Series(range(5),index=['a','a','b','b','c'])
obj
a 0
a 1
b 2
b 3
c 4
dtype: int32
索引的is_unique属性可以告诉你它的值是否是唯一的:
obj.index.is_unique
False 如果某个索引对应多个值,则返回一个Series;而对应单个值,则返回一个标量值。
obj['a']
a 0
a 1
dtype: int32
obj['c']
4
对DataFrame的行进行索引时也是如此:
df=DataFrame(np.random.randn(4,3),index=['a','a','b','b'])
df
0 | 1 | 2 | |
---|---|---|---|
a | -1.143332 | 0.282357 | -0.751916 |
a | 0.201693 | -0.851545 | 0.915838 |
b | 0.542148 | -0.419044 | -0.540100 |
b | 1.162466 | -0.425086 | 0.366470 |
df.ix['b']
0 | 1 | 2 | |
---|---|---|---|
b | 0.542148 | -0.419044 | -0.54010 |
b | 1.162466 | -0.425086 | 0.36647 |
汇总和计算描述统计
pandas对象拥有一组常用的数学和统计方法,它们大部分都属于约简和汇总统计,用于从Series中提取单个值(如sum和mean)或从DataFrame的行或列中提取一个Series。跟对应的NumPy数组方法相比,它们都是基于没有缺失数据的假设而构建的。
df=DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],index=['a','b','c','d'],columns=['one','two'])
df
one | two | |
---|---|---|
a | 1.40 | NaN |
b | 7.10 | -4.5 |
c | NaN | NaN |
d | 0.75 | -1.3 |
调用DataFrame的sum方法将会返回一个含有列小计的Series:
df.sum()
one 9.25
two -5.80
dtype: float64
传入axis=1将会按行进行求和运算:
df.sum(axis=1)
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
有些方法(如idxmin和idxmax)返回的是间接统计:
df.idxmax()
one b
two d
dtype: object
另一些方法则是累计型的:
df.cumsum()
one | two | |
---|---|---|
a | 1.40 | NaN |
b | 8.50 | -4.5 |
c | NaN | NaN |
d | 9.25 | -5.8 |
describe用于一次性产生多个汇总统计:
df.describe()
C:\Anaconda3\lib\site-packages\numpy\lib\function_base.py:3823: RuntimeWarning: Invalid value encountered in percentile RuntimeWarning)
one | two | |
---|---|---|
count | 3.000000 | 2.000000 |
mean | 3.083333 | -2.900000 |
std | 3.493685 | 2.262742 |
min | 0.750000 | -4.500000 |
25% | NaN | NaN |
50% | NaN | NaN |
75% | NaN | NaN |
max | 7.100000 | -1.300000 |
对于非数值型数据,describe会产生另一种汇总统计:
obj=Series(['a','a','b','c']*4)
obj.describe()
count 16
unique 3
top a
freq 8
dtype: object
相关系数与协方差
import pandas.io.data as web
C:\Anaconda3\lib\site-packages\pandas\io\data.py:35: FutureWarning:
The pandas.io.data module is moved to a separate package (pandas-datareader) and will be removed from pandas in a future version.
After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.
FutureWarning)
all_data = {}
for ticker in ['AAPL','IBM','MSFT','GOOG']:
all_data[ticker]=web.DataReader(ticker,'yahoo','1/1/2000','1/1/2010')
price = DataFrame({tic: data['Adj Close']
for tic, data in all_data.iteritems()})
volume = DataFrame({tic: data['Volume']
for tic, data in all_data.iteritems()})
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-145-c25e66e2aee1> in <module>()
1 price = DataFrame({tic: data['Adj Close']
----> 2 for tic, data in all_data.iteritems()})
3 volume = DataFrame({tic: data['Volume']
4 for tic, data in all_data.iteritems()})
AttributeError: 'dict' object has no attribute 'iteritems'
总是报错,不知道原因。
唯一值、值计数以及成员资格
obj=Series(['c','a','d','a','a','b','b','c','c'])
uniques=obj.unique()
uniques
array(['c', 'a', 'd', 'b'], dtype=object)
value_counts用于计算一个Series中各值出现的频率:
obj.value_counts()
c 3
a 3
b 2
d 1
dtype: int64
pd.value_counts(obj.values,sort=False)
a 3
d 1
b 2
c 3
dtype: int64
isin,用于判断矢量化集合的会员资格,可用于选取Series中或DataFrame列中数据的子集:
mask=obj.isin(['b','c'])
mask
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
obj[mask]
0 c
5 b
6 b
7 c
8 c
dtype: object
data=DataFrame({'Qu1':[1,3,4,3,4],
'Qu2':[2,3,1,2,3],
'Qu3':[1,5,2,4,4]})
data
Qu1 | Qu2 | Qu3 | |
---|---|---|---|
0 | 1 | 2 | 1 |
1 | 3 | 3 | 5 |
2 | 4 | 1 | 2 |
3 | 3 | 2 | 4 |
4 | 4 | 3 | 4 |
result=data.apply(pd.value_counts).fillna(0)
result
Qu1 | Qu2 | Qu3 | |
---|---|---|---|
1 | 1.0 | 1.0 | 1.0 |
2 | 0.0 | 2.0 | 1.0 |
3 | 2.0 | 2.0 | 0.0 |
4 | 2.0 | 0.0 | 2.0 |
5 | 0.0 | 0.0 | 1.0 |
处理缺失数据
pandas使用浮点值NaN(Not a Number)表示浮点和非浮点数组中的缺失数据,它只是一个便于被检测出来的标记而已:
string_data=Series(['aardvark','artichoke',np.nan,'avocado'])
string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
string_data.isnull()
0 False
1 False
2 True
3 False
dtype: bool
string_data[0]=None
string_data.isnull()
0 True
1 False
2 True
3 False
dtype: bool
滤除缺失数据
dropna用于过滤掉缺失数据,对于Series,dropna返回一个仅含非空数据和索引值的Series:
from numpy import nan as NA
data=Series([1,NA,3.5,NA,7])
data.dropna()
0 1.0
2 3.5
4 7.0
dtype: float64
data[data.notnull()]
0 1.0
2 3.5
4 7.0
dtype: float64
对于DataFrame对象,dropna默认丢弃任何含有缺失值的行:
data=DataFrame([[1.,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,6.5,3.]])
cleaned=data.dropna()
data
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
cleaned
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
data.dropna(how='all')
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
data[4]=NA
data
0 | 1 | 2 | 4 | |
---|---|---|---|---|
0 | 1.0 | 6.5 | 3.0 | NaN |
1 | 1.0 | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 | NaN |
data.dropna(axis=1,how='all')
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
df=DataFrame(np.random.randn(7,3))
df.ix[:4,1]=NA
df.ix[:2,2]=NA
df
0 | 1 | 2 | |
---|---|---|---|
0 | -0.625413 | NaN | NaN |
1 | -0.149081 | NaN | NaN |
2 | 0.508126 | NaN | NaN |
3 | 1.302235 | NaN | 0.779230 |
4 | 0.314148 | NaN | -0.354226 |
5 | -2.301367 | -0.016120 | -0.350993 |
6 | -0.810814 | -1.352287 | 1.630001 |
df.dropna(thresh=3)
0 | 1 | 2 | |
---|---|---|---|
5 | -2.301367 | -0.016120 | -0.350993 |
6 | -0.810814 | -1.352287 | 1.630001 |
填充缺失数据
通过一个常用数调用fillna就会将缺失值替换为那个常数值:
df.fillna(0)
0 | 1 | 2 | |
---|---|---|---|
0 | -0.625413 | 0.000000 | 0.000000 |
1 | -0.149081 | 0.000000 | 0.000000 |
2 | 0.508126 | 0.000000 | 0.000000 |
3 | 1.302235 | 0.000000 | 0.779230 |
4 | 0.314148 | 0.000000 | -0.354226 |
5 | -2.301367 | -0.016120 | -0.350993 |
6 | -0.810814 | -1.352287 | 1.630001 |
若是通过一个字典调用fillna,就可以实现对不同的列填充不同的值:
df.fillna({1:0.5,3:-1})
0 | 1 | 2 | |
---|---|---|---|
0 | -0.625413 | 0.500000 | NaN |
1 | -0.149081 | 0.500000 | NaN |
2 | 0.508126 | 0.500000 | NaN |
3 | 1.302235 | 0.500000 | 0.779230 |
4 | 0.314148 | 0.500000 | -0.354226 |
5 | -2.301367 | -0.016120 | -0.350993 |
6 | -0.810814 | -1.352287 | 1.630001 |
df
0 | 1 | 2 | |
---|---|---|---|
0 | -0.625413 | NaN | NaN |
1 | -0.149081 | NaN | NaN |
2 | 0.508126 | NaN | NaN |
3 | 1.302235 | NaN | 0.779230 |
4 | 0.314148 | NaN | -0.354226 |
5 | -2.301367 | -0.016120 | -0.350993 |
6 | -0.810814 | -1.352287 | 1.630001 |
#总是返回被填充对象的引用
_=df.fillna(0,inplace=True)
df
0 | 1 | 2 | |
---|---|---|---|
0 | -0.625413 | 0.000000 | 0.000000 |
1 | -0.149081 | 0.000000 | 0.000000 |
2 | 0.508126 | 0.000000 | 0.000000 |
3 | 1.302235 | 0.000000 | 0.779230 |
4 | 0.314148 | 0.000000 | -0.354226 |
5 | -2.301367 | -0.016120 | -0.350993 |
6 | -0.810814 | -1.352287 | 1.630001 |
对reindex有效的那些插值方法也可用于fillna
df=DataFrame(np.random.randn(6,3))
df.ix[2:,1]=NA
df.ix[4:,2]=NA
df
0 | 1 | 2 | |
---|---|---|---|
0 | 0.795837 | 1.038840 | 1.805921 |
1 | -0.990684 | -0.542918 | 1.262955 |
2 | 0.215151 | NaN | 1.680489 |
3 | -1.881573 | NaN | -0.301868 |
4 | 0.491152 | NaN | NaN |
5 | 0.258699 | NaN | NaN |
df.fillna(method='ffill')
0 | 1 | 2 | |
---|---|---|---|
0 | 0.795837 | 1.038840 | 1.805921 |
1 | -0.990684 | -0.542918 | 1.262955 |
2 | 0.215151 | -0.542918 | 1.680489 |
3 | -1.881573 | -0.542918 | -0.301868 |
4 | 0.491152 | -0.542918 | -0.301868 |
5 | 0.258699 | -0.542918 | -0.301868 |
df.fillna(method='ffill',limit=2)
0 | 1 | 2 | |
---|---|---|---|
0 | 0.795837 | 1.038840 | 1.805921 |
1 | -0.990684 | -0.542918 | 1.262955 |
2 | 0.215151 | -0.542918 | 1.680489 |
3 | -1.881573 | -0.542918 | -0.301868 |
4 | 0.491152 | NaN | -0.301868 |
5 | 0.258699 | NaN | -0.301868 |
利用fillna传入Series的平均值或中位数:
data=Series([1.,NA,3.5,NA,7])
data
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
data.mean()
3.8333333333333335
data.fillna(data.mean())
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64
层次化索引
层次化索引(hierarchical indexing),使你能在一个轴上拥有多个(两个以上)索引级别。
data=Series(np.random.randn(10),index=[['a','a','a','b','b','b','c','c','d','d'],[1,2,3,1,2,3,1,2,2,3]])
data
a 1 0.091014
2 1.542964
3 -0.287869
b 1 1.551622
2 -2.931760
3 0.751749
c 1 1.660620
2 -1.493720
d 2 0.718965
3 0.826192
dtype: float64
这就是带有MultiIndex索引的Series的格式化输出形式:
data.index
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
data['b']
1 1.551622
2 -2.931760
3 0.751749
dtype: float64
data['b':'c']
b 1 1.551622
2 -2.931760
3 0.751749
c 1 1.660620
2 -1.493720
dtype: float64
data.ix[['b','d']]
b 1 1.551622
2 -2.931760
3 0.751749
d 2 0.718965
3 0.826192
dtype: float64
在“内层”中进行选取
data[:,2]
a 1.542964
b -2.931760
c -1.493720
d 0.718965
dtype: float64
这段数据可以通过其unstack方法重新安排到一个DataFrame中:
data.unstack()
1 | 2 | 3 | |
---|---|---|---|
a | 0.091014 | 1.542964 | -0.287869 |
b | 1.551622 | -2.931760 | 0.751749 |
c | 1.660620 | -1.493720 | NaN |
d | NaN | 0.718965 | 0.826192 |
unstack的逆运算是stack:
data.unstack().stack()
a 1 0.091014
2 1.542964
3 -0.287869
b 1 1.551622
2 -2.931760
3 0.751749
c 1 1.660620
2 -1.493720
d 2 0.718965
3 0.826192
dtype: float64
对于一个DataFrame,每条轴都可以有分层索引:
frame=DataFrame(np.arange(12).reshape((4,3)),index=[['a','a','b','b'],[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])
frame
Ohio | Colorado | |||
---|---|---|---|---|
Green | Red | Green | ||
a | 1 | 0 | 1 | 2 |
2 | 3 | 4 | 5 | |
b | 1 | 6 | 7 | 8 |
2 | 9 | 10 | 11 |
frame.index.names=['key1','key2']
frame.columns.names=['state','color']
frame
state | Ohio | Colorado | ||
---|---|---|---|---|
color | Green | Red | Green | |
key1 | key2 | |||
a | 1 | 0 | 1 | 2 |
2 | 3 | 4 | 5 | |
b | 1 | 6 | 7 | 8 |
2 | 9 | 10 | 11 |
frame['Ohio']
color | Green | Red | |
---|---|---|---|
key1 | key2 | ||
a | 1 | 0 | 1 |
2 | 3 | 4 | |
b | 1 | 6 | 7 |
2 | 9 | 10 |
重排分级顺序
swaplevel接受两个级别编号或名称,并返回一个互换了级别的新对象(但数据不会发生变化):
frame.swaplevel('key1','key2')
state | Ohio | Colorado | ||
---|---|---|---|---|
color | Green | Red | Green | |
key2 | key1 | |||
1 | a | 0 | 1 | 2 |
2 | a | 3 | 4 | 5 |
1 | b | 6 | 7 | 8 |
2 | b | 9 | 10 | 11 |
sortlevel根据单个级别中的值对数据进行排序(稳定的)。交换级别时,常常也会用到sortlevel,这样最终结果就是有序的了:
frame.sortlevel(1)
state | Ohio | Colorado | ||
---|---|---|---|---|
color | Green | Red | Green | |
key1 | key2 | |||
a | 1 | 0 | 1 | 2 |
b | 1 | 6 | 7 | 8 |
a | 2 | 3 | 4 | 5 |
b | 2 | 9 | 10 | 11 |
frame.swaplevel(0,1).sortlevel(0)
state | Ohio | Colorado | ||
---|---|---|---|---|
color | Green | Red | Green | |
key2 | key1 | |||
1 | a | 0 | 1 | 2 |
b | 6 | 7 | 8 | |
2 | a | 3 | 4 | 5 |
b | 9 | 10 | 11 |
根据级别汇总统计
许多对DataFrame和Series的描述和汇总统计都有一个level选项,它用于指定在某条轴上求和的级别:
frame.sum(level='key2')
state | Ohio | Colorado | |
---|---|---|---|
color | Green | Red | Green |
key2 | |||
1 | 6 | 8 | 10 |
2 | 12 | 14 | 16 |
frame.sum(level='color',axis=1)
color | Green | Red | |
---|---|---|---|
key1 | key2 | ||
a | 1 | 2 | 1 |
2 | 8 | 4 | |
b | 1 | 14 | 7 |
2 | 20 | 10 |
使用DataFrame的列
将DataFrame的一个或多个列当做行索引来用,或者可能希望将行索引变成DataFrame的列:
frame=DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one','one','one','two','two','two','two'],'d':[0,1,2,0,1,2,3]})
frame
a | b | c | d | |
---|---|---|---|---|
0 | 0 | 7 | one | 0 |
1 | 1 | 6 | one | 1 |
2 | 2 | 5 | one | 2 |
3 | 3 | 4 | two | 0 |
4 | 4 | 3 | two | 1 |
5 | 5 | 2 | two | 2 |
6 | 6 | 1 | two | 3 |
DataFrame的set_index函数会将其一个或多个列转换为行索引,并创建一个新的DataFrame:
frame2=frame.set_index(['c','d'])
frame2
a | b | ||
---|---|---|---|
c | d | ||
one | 0 | 0 | 7 |
1 | 1 | 6 | |
2 | 2 | 5 | |
two | 0 | 3 | 4 |
1 | 4 | 3 | |
2 | 5 | 2 | |
3 | 6 | 1 |
frame.set_index(['c','d'],drop=False)
a | b | c | d | ||
---|---|---|---|---|---|
c | d | ||||
one | 0 | 0 | 7 | one | 0 |
1 | 1 | 6 | one | 1 | |
2 | 2 | 5 | one | 2 | |
two | 0 | 3 | 4 | two | 0 |
1 | 4 | 3 | two | 1 | |
2 | 5 | 2 | two | 2 | |
3 | 6 | 1 | two | 3 |
reset_index的功能跟set_index刚好相反,层次化索引的级别会被转移到列里面:
frame2.reset_index()
c | d | a | b | |
---|---|---|---|---|
0 | one | 0 | 0 | 7 |
1 | one | 1 | 1 | 6 |
2 | one | 2 | 2 | 5 |
3 | two | 0 | 3 | 4 |
4 | two | 1 | 4 | 3 |
5 | two | 2 | 5 | 2 |
6 | two | 3 | 6 | 1 |
其他有关pandas的话题
整数索引
如果你需要可靠的、不考虑索引类型的、基于位置的索引,可以使用Series的iget_value方法和DataFrame的irow和icol方法:
ser3=Series(range(3),index=[-5,1,3])
ser3.iget_value(2)
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: iget_value(i) is deprecated. Please use .iloc[i] or .iat[i]
from ipykernel import kernelapp as app
2
frame=DataFrame(np.arange(6).reshape(3,2),index=[2,0,1])
frame
0 | 1 | |
---|---|---|
2 | 0 | 1 |
0 | 2 | 3 |
1 | 4 | 5 |
frame.irow(0)
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: irow(i) is deprecated. Please use .iloc[i]
if __name__ == '__main__':
0 0
1 1
Name: 2, dtype: int32
面板数据
pandas有一个Panel数据结构,可以将其看作一个三维版的DataFrame,可以用一个由DataFrame对象组成的字典或一个三位ndarray来创建Panel对象:
import pandas.io.data as web
pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk))
for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL']))
pdata
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 14 (major_axis) x 0 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: </html> to <script language=javascript type="text/javascript">
Minor_axis axis: None
始终报错。