问题导读:
1.pandas的数据结构介绍
2.利用pandas对数据的基本操作
3.汇总和计算描述统计方法
4.处理缺失数据
5.层次化索引
解决方案:
pandas的数据结构介绍:
(1)Series
- Series 是一种类似于一维数组的对象,它由一组数据(各种Numpy数据类型)以及与之相关的数据标签(索引)组成
In [6]: obj = pd.Series([1,2,3,4])
In [7]: obj
Out[7]:
0 1
1 2
2 3
3 4
dtype: int64
- Series 字符串的表现形式为:索引在左边,值在右边
In [8]: obj.values
Out[8]: array([1, 2, 3, 4])
In [9]: obj.index
Out[9]: Int64Index([0, 1, 2, 3], dtype='int64')
- 创建的Series 带有一个可以对各个数据点进行标记的索引
In [10]: obj2 = pd.Series([2,4,1,6],index=['a','b','c','d'])
In [11]: obj2
Out[11]:
a 2
b 4
c 1
d 6
dtype: int64
In [12]: obj2.c
Out[12]: 1
- 计算(根据bool型数据过滤、标量乘法、应用数学函数)之后index 和 values 依然保持连接
In [18]: obj2[obj2 > 0]
Out[18]:
a 2
b 4
c 1
d 6
dtype: int64
In [19]: obj2 * 2
Out[19]:
a 4
b 8
c 2
d 12
dtype: int64
In [20]: np.exp(obj2)
Out[20]:
a 7.389056
b 54.598150
c 2.718282
d 403.428793
dtype: float64
- 通过python字典直接创建Series
In [21]: sdata = {'Ohio':2000, 'Texas':3000, 'Utah':3425, 'Oregon':3908}
In [22]: obj3 = pd.Series(sdata)
In [23]: obj3
Out[23]:
Ohio 2000
Oregon 3908
Texas 3000
Utah 3425
dtype: int64
In [24]: obj3.values
Out[24]: array([2000, 3908, 3000, 3425])
In [25]: obj3.index
Out[25]: Index([u'Ohio', u'Oregon', u'Texas', u'Utah'], dtype='object')
- 如果只传入一个字典,则结果中的索引就是按照原字典的键(有序排列)
- 如果传入的index 列表中的元素和字典中的键相互匹配不到,则该索引位置的values 为NaN
n [26]: states = ['california','Ohio','Oregon','Texas']
In [27]: obj4 = pd.Series(sdata,index = states)
In [28]: obj4
Out[28]:
california NaN
Ohio 2000
Oregon 3908
Texas 3000
dtype: float64
- pandas 的isnull 和notnull 函数可以用于检测缺失数据
In [34]: obj4.isnull()
Out[34]:
california True
Ohio False
Oregon False
Texas False
dtype: bool
In [35]: obj4.notnull()
Out[35]:
california False
Ohio True
Oregon True
Texas True
dtype: bool
- Series 一个重要的功能是:它在运算中会自动对齐不同索引的数据
In [36]: obj3
Out[36]:
Ohio 2000
Oregon 3908
Texas 3000
Utah 3425
dtype: int64
In [37]: obj4
Out[37]:
california NaN
Ohio 2000
Oregon 3908
Texas 3000
dtype: float64
In [38]: obj3 + obj4
Out[38]:
Ohio 4000
Oregon 7816
Texas 6000
Utah NaN
california NaN
dtype: float64
- Series 对象本身 及其索引都有一个name 属性
- Series 的索引可以通过赋值的方式就地修改
In [40]: obj4.name = 'population'
In [41]: obj4.index.name = 'state'
In [42]: obj4
Out[42]:
state
california NaN
Ohio 2000
Oregon 3908
Texas 3000
Name: population, dtype: float64
In [43]: obj
Out[43]:
0 1
1 2
2 3
3 4
dtype: int64
In [45]: obj.index = ['a','b','c','d']
In [46]: obj
Out[46]:
a 1
b 2
c 3
d 4
dtype: int64
(2)DataFrame
- DataFrame 是一个表格型的数据结构,每列可以是不同的值类型(数值,字符串,布尔值)。
- DataFrame 可以被看做由Series组成的字典(共用同一个索引)
- 通过字典创建DataFrame:每一columns对应的是字典中每个key - value 对,有index 行
In [16]: data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year':[2000, 2001, 2002, 2003, 2004],
'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}
In [17]: data
Out[17]:
{'pop': [1.5, 1.7, 3.6, 2.4, 2.9],
'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2003, 2004]}
In [18]: frame01 = pd.DataFrame(data,columns = ['year','state','pop'],index = ['1','2','3','4','5'])
In [19]: frame01
Out[19]:
year state pop
1 2000 Ohio 1.5
2 2001 Ohio 1.7
3 2002 Ohio 3.6
4 2003 Nevada 2.4
5 2004 Nevada 2.9
- 如何指定了列序列,则DataFrame的列就会按照制定顺序进行排序
- 和Series 一样,如果传入的列在数据中找不到,就会产生NA值
- 获取列,获得的是一个series
In [31]: frame01['year']
Out[31]:
1 2000
2 2001
3 2002
4 2003
5 2004
Name: year, dtype: int64
In [32]: frame01['pop']
Out[32]:
1 1.5
2 1.7
3 3.6
4 2.4
5 2.9
Name: pop, dtype: float64
In [33]: frame01.state
Out[33]:
1 Ohio
2 Ohio
3 Ohio
4 Nevada
5 Nevada
Name: state, dtype: object
- 行可以通过索引字段ix获取
In [5]: frame01.ix[1]
Out[5]:
year 2001
state Ohio
pop 1.7
Name: 2, dtype: object
In [6]: frame01.ix['1']
Out[6]:
year 2000
state Ohio
pop 1.5
Name: 1, dtype: object
- 列的索引依然为DataFrame 的索引
In [35]: frame02
Out[35]:
state year pop
one Ohio 2000 1.5
two Ohio 2001 1.7
three Ohio 2002 3.6
four Nevada 2003 2.4
five Nevada 2004 2.9
In [36]: frame02.state
Out[36]:
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
Name: state, dtype: object
In [37]: frame02.year
Out[37]:
one 2000
two 2001
three 2002
four 2003
five 2004
Name: year, dtype: int64
- 修改元素
- 列可以通过赋值的方式进行修改
- 将列表或数组赋值给某个列时,其长度必须和DataFrame的长度相匹配
- 如果赋值的是一个Series,就会精确匹配DataFrame的索引,所有的空位都将被填上缺失值
In [22]: frame
Out[22]:
state year pop
one Ohio 1 1
two Ohio 1 1
three Ohio 1 1
four Nevada 1 1
five Nevada 1 1
In [23]: frame.year = np.arange(2011,2016)
In [24]: frame.pop = np.random.randn(5)
In [25]: frame
Out[25]:
state year pop
one Ohio 2011 0.968678
two Ohio 2012 0.280024
three Ohio 2013 -0.396301
four Nevada 2014 -0.132369
five Nevada 2015 -0.181226
In [40]: frame
Out[40]:
state year pop debt
one Ohio 1 1 NaN
two Ohio 1 1 NaN
three Ohio 1 1 NaN
four Nevada 1 1 NaN
five Nevada 1 1 NaN
In [41]: val = pd.Series([-1.2,-1.5,-1.7], index = ['two', 'four', 'five'])
In [42]: frame['dedt'] = val
In [43]: frame
Out[43]:
state year pop debt dedt
one Ohio 1 1 NaN NaN
two Ohio 1 1 NaN -1.2
three Ohio 1 1 NaN NaN
four Nevada 1 1 NaN -1.5
five Nevada 1 1 NaN -1.7
- 删除列
In [44]: frame['newline'] = frame.state != 'Ohio'
In [45]: frame
Out[45]:
state year pop debt dedt newline
one Ohio 1 1 NaN NaN False
two Ohio 1 1 NaN -1.2 False
three Ohio 1 1 NaN NaN False
four Nevada 1 1 NaN -1.5 True
five Nevada 1 1 NaN -1.7 True
In [46]: del frame['newline']
In [47]: frame
Out[47]:
state year pop debt dedt
one Ohio 1 1 NaN NaN
two Ohio 1 1 NaN -1.2
three Ohio 1 1 NaN NaN
four Nevada 1 1 NaN -1.5
five Nevada 1 1 NaN -1.7
- 补充说明
In [69]: frame.columns.name = 'state'
In [70]: frame.index.name = 'myname'
In [71]: frame.columns.name = 'state'
In [72]: frame
Out[72]:
state state year pop debt dedt
myname
one Ohio 1 1 NaN NaN
two Ohio 1 1 NaN -1.2
three Ohio 1 1 NaN NaN
four Nevada 1 1 NaN -1.5
five Nevada 1 1 NaN -1.7
In [73]: frame.values
Out[73]:
array([['Ohio', '1', '1', nan, nan],
['Ohio', '1', '1', nan, -1.2],
['Ohio', '1', '1', nan, nan],
['Nevada', '1', '1', nan, -1.5],
['Nevada', '1', '1', nan, -1.7]], dtype=object)
(3)可以输入给DataFrame构造器的数据
(3)索引对象
- pandas 的索引对象负责管理轴标签和其他元数据(比如轴名称)
n [4]: obj = pd.Series(range(3),index=['a','b','c'])
In [5]: obj
Out[5]:
a 0
b 1
c 2
dtype: int64
In [6]: index = obj.index
In [7]: index
Out[7]: Index([u'a', u'b', u'c'], dtype='object')
In [8]: index[1:]
Out[8]: Index([u'b', u'c'], dtype='object')
In [9]: index[
- Index 对象是不可以修改的
- 不可修改性很重要,因为这样才能使Index对象在多个数据结构之间安全共享
n [4]: obj = pd.Series(range(3),index=['a','b','c'])
In [5]: obj
Out[5]:
a 0
b 1
c 2
dtype: int64
In [6]: index = obj.index
In [7]: index
Out[7]: Index([u'a', u'b', u'c'], dtype='object')
In [8]: index[1:]
Out[8]: Index([u'b', u'c'], dtype='object')
In [42]: index
Out[42]: Index([u'a', u'b', u'c'], dtype='object')
In [43]: obj2 = pd.Series(range(3),index = index)
In [44]: obj2.index
obj2.index
In [44]: obj2.index is index
Out[44]: True
In [45]: index
Out[45]: Index([u'a', u'b', u'c'], dtype='object')
In [46]: obj2 = pd.Series(range(3),index = index)
In [47]: obj2.index is index
Out[47]: True
- Python中的对象包含三要素:id、type、value
is判断的是a对象是否就是b对象,是通过id来判断的
==判断的是a对象的值是否和b对象的值相等,是通过value来判断的
In [48]: index2=index
In [49]: index2
Out[49]: Index([u'a', u'b', u'c'], dtype='object')
In [50]: obj2.index is index2
Out[50]: True
In [51]: index3 = ['a','b','c']
In [52]: index3 is index
Out[52]: False
In [53]: obj2.index is index3
Out[53]: False
- pandas中主要的Index对象
- Index的功能也类似一个固定大小的集合
In [28]: frame
Out[28]:
Nevada Ohio
2000 2.2 1.5
2001 2.1 1.7
2002 2.1 3.6
In [29]: 'Ohio' in frame.columns
Out[29]: True
In [30]: 2002 in frame.index
Out[30]: True
In [31]: 1999 in frame.index
Out[31]: False
- Index的方法和属性
pandas数据处理基本方法:
(1)重新索引
- pandas对象的一个重要的方法是reindex,其作用是创建一个适应新索引的新对象
- 调用Series 的 reindex 将会根据新索引进行重排,如果某个索引值当前不存在,就引入缺失值
In [38]: obj = pd.Series([4.5,7.2,-5.3,3.8], index = ['d','b','a','c'])
In [39]: obj
Out[39]:
d 4.5
b 7.2
a -5.3
c 3.8
dtype: float64
In [40]: obj2 = obj.reindex(['a','b','c','d','e'])
In [41]: obj2
Out[41]:
a -5.3
b 7.2
c 3.8
d 4.5
e NaN
dtype: float64
- 为了防止索引值不存在的情况可以用 fill_value 值来设置无索引的默认值
In [42]: obj2 = obj.reindex(['a','b','c','d','e'],fill_value = 0)
In [43]: obj2
Out[43]:
a -5.3
b 7.2
c 3.8
d 4.5
e 0.0
dtype: float64
- 对于时间序列这样的有序数据,重新索引时可以需要做一些插值操作处理,method选项即可达到此目的
- ffill可以实现向前值填充
In [44]: obj3 = pd.Series(['blue','purple','yellow'],index=[0,2,4])
In [45]: obj3.reindex(range(6),method='ffill')
Out[45]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
- reindex的(插值)method选项
- 对于DataFrame,reindex可以修改(行)索引,列,或两个都修改
- 如果仅仅传入一个序列,则会重新索引行
In [47]: frame = pd.DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],
....: columns=['Ohio','Texas','California'])
In [48]: frame
Out[48]:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
In [49]: frame2 = frame.reindex(['a','b','c','d'])
In [50]: frame2
Out[50]:
Ohio Texas California
a 0 1 2
b NaN NaN NaN
c 3 4 5
d 6 7 8
In [51]: states = ['Texas','Utah','California']
In [52]: frame.reindex(columns=states)
Out[52]:
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
- 对行和列同时进行索引,而插值则只能按行应用(即轴0)
In [54]: frame.reindex(index=['a','b','c','d'],method = 'ffill',
....: columns = states)
Out[54]:
Texas Utah California
a 1 NaN 2
b 1 NaN 2
c 4 NaN 5
d 7 NaN 8
- 利用ix的标签索引功能索引
In [56]: frame.ix[['a','c'],'Texas']
Out[56]:
a 1
c 4
Name: Texas, dtype: int64
- reindex函数的参数
(2)丢弃指定轴上的项
- drop方法返回的是一个在指定轴上删除了指定值的新对象
In [58]: obj = pd.Series(np.arange(5.),index=['a','b','c','d','e'])
In [59]: new_obj = obj.drop('c')
In [60]: obj
Out[60]:
a 0
b 1
c 2
d 3
e 4
dtype: float64
In [61]: new_obj
Out[61]:
a 0
b 1
d 3
e 4
dtype: float64
- DataFrame可以删除任意轴上的索引值
In [68]: data = pd.DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],
columns=['one','two','three','four'])
In [69]: data
Out[69]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [70]: data.drop(['Colorado','Ohio'])
Out[70]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15
In [71]: data.drop('two',axis=1)
Out[71]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
In [72]: data.drop(['one','four'],axis=1)
Out[72]:
two three
Ohio 1 2
Colorado 5 6
Utah 9 10
New York 13 14
(3)索引、选取和过滤
- Series 索引工作方式类似与numpy数组的索引,只不过Series的索引值不只是整数
In [3]: obj = pd.Series(np.arange(4.),index=['a','b','c','d'])
In [4]: obj
Out[4]:
a 0
b 1
c 2
d 3
dtype: float64
In [5]: obj['b']
Out[5]: 1.0
In [6]: obj[1]
Out[6]: 1.0
In [7]: obj[2:4]
Out[7]:
c 2
d 3
dtype: float64
In [8]: obj[['b','d']]
Out[8]:
b 1
d 3
dtype: float64
In [9]: obj[['1','3']]
Out[9]:
1 NaN
3 NaN
dtype: float64
In [10]: obj[[1,3]]
Out[10]:
b 1
d 3
dtype: float64
In [11]: obj[obj<2]
Out[11]:
a 0
b 1
dtype: float64
- 利用标签的切片运算与普通的python切片运算不同,其末端是包含的,而普通的切片不包含
In [18]: obj[0:3]
Out[18]:
a 0
b 1
c 2
dtype: float64
In [19]: obj['a':'d']
Out[19]:
a 0
b 1
c 2
d 3
dtype: float64
- 切片赋值
In [21]: obj['a':'c'] = 1
In [22]: obj
Out[22]:
a 1
b 1
c 1
d 3
dtype: float64
- 对DataFrame进行索引就是获取一个或多个列
- 索引方式有几个特殊的情况。首先通过切片或布尔型数组选取行
In [24]: data = pd.DataFrame(np.arange(16).reshape((4,4)),index = ['Ohio','Colorado','Utah','New York'],
....: columns=['one','two','three','four'])
In [25]: data
Out[25]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
[4 rows x 4 columns]
In [26]: data[:2]
Out[26]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
[2 rows x 4 columns]
In [27]: data[data['three'] > 5]
Out[27]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
[3 rows x 4 columns]
- 上面那种方式比较不符合逻辑,毕竟来源于实践
- 另一种方式可以通过布尔型DataFrame来进行索引
In [38]: data < 5
Out[38]:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
[4 rows x 4 columns]
In [39]: data[data < 5] = 0
In [40]: data
Out[40]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
[4 rows x 4 columns]
- 使用ix专门的索引字段
In [44]: data.ix['Colorado',['tow','three']]
Out[44]:
tow NaN
three 6
Name: Colorado, dtype: float64
In [45]: data.ix['Colorado',['two','three']]
Out[45]:
two 5
three 6
Name: Colorado, dtype: int64
In [46]: data.ix[['Colorado','Utah'],[3,1,0]]
Out[46]:
four two one
Colorado 7 5 0
Utah 11 9 8
[2 rows x 3 columns]
In [47]: data.ix[2]
Out[47]:
one 8
two 9
three 10
four 11
Name: Utah, dtype: int64
In [48]: data.ix[:'Utah','two']
Out[48]:
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int64
In [49]: data.ix[data.three > 5,:3]
Out[49]:
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14
[3 rows x 3 columns]
- DataFrame的索引选项:
(4)算数运算和数据对齐
- pandas最重要的一个功能是可以对不同索引值的对象进行算数运算。
- 如果存在不同的索引对,则结果的索引就是该索引对的并集
In [51]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index = ['a','c','d','e'])
In [52]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index = ['a','c','e','f','g'])
In [53]: s1
Out[53]:
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
In [54]: s2
Out[54]:
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
In [55]: s1 + s2
Out[55]:
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
- 对于DataFrame,对齐操作会同时发生在行和列上
In [61]: df1 = pd.DataFrame(np.arange(9.).reshape((3,3)),columns=list('bcd'),
index = ['Ohio','Texas','Colorado'])
In [62]: df2 = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),
index = ['Utah','Ohio','Texas','Oregon'])
In [63]: df1
Out[63]:
b c d
Ohio 0 1 2
Texas 3 4 5
Colorado 6 7 8
[3 rows x 3 columns]
In [64]: df2
Out[64]:
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
[4 rows x 3 columns]
In [65]: df1 + df2
Out[65]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3 NaN 6 NaN
Oregon NaN NaN NaN NaN
Texas 9 NaN 12 NaN
Utah NaN NaN NaN NaN
[5 rows x 4 columns]
4.1 在算数方法中填充值
- 在对不同索引对象进行算数运算时,你可能希望当一个对象中某个轴标签在另一个对象中找不到时填充一个特殊值
In [66]: df1 = pd.DataFrame(np.arange(12.).reshape((3,4)),columns=list('abcd'))
In [67]: df2 = pd.DataFrame(np.arange(20.).reshape((4,5)),columns=list('abcde'))
In [68]: df1 + df2
Out[68]:
a b c d e
0 0 2 4 6 NaN
1 9 11 13 15 NaN
2 18 20 22 24 NaN
3 NaN NaN NaN NaN NaN
[4 rows x 5 columns]
In [69]: df1.add(df2,fill_value=0)
Out[69]:
a b c d e
0 0 2 4 6 4
1 9 11 13 15 9
2 18 20 22 24 14
3 15 16 17 18 19
[4 rows x 5 columns]
- 与此类似,Series 或 DataFrame 重新索引时,也可以指定一个填充值
In [72]: df1.reindex(columns=df2.columns,fill_value=0)
Out[72]:
a b c d e
0 0 1 2 3 0
1 4 5 6 7 0
2 8 9 10 11 0
[3 rows x 5 columns]
- 算术方法
4.2 DataFrame和Series之间的运算
- 先来计算一个二维数组与其某行之间的差
- 这样的运算叫做广播
In [73]: arr = np.arange(12.).reshape((3,4))
In [74]: arr
Out[74]:
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
In [75]: arr[0]
Out[75]: array([ 0., 1., 2., 3.])
In [76]: array([0.,1.,2.,3.])
Out[76]: array([ 0., 1., 2., 3.])
In [77]: arr - arr[0]
Out[77]:
array([[ 0., 0., 0., 0.],
[ 4., 4., 4., 4.],
[ 8., 8., 8., 8.]])
- DataFrame与Series 之间的运算和上面差不多
- 默认情况下,DataFrame 和 Series 之间的算术运算会将Series的索引匹配到DataFrame的列,然后沿着行向下广播
In [78]: frame = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
In [79]: series = frame.ix[0]
In [80]: frame
Out[80]:
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
[4 rows x 3 columns]
In [81]: series
Out[81]:
b 0
d 1
e 2
Name: Utah, dtype: float64
In [82]: frame - series
Out[82]:
b d e
Utah 0 0 0
Ohio 3 3 3
Texas 6 6 6
Oregon 9 9 9
[4 rows x 3 columns]
- 如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集
In [89]: series2 = pd.Series(range(3),index=['b','e','f'])
In [90]: frame + series2
Out[90]:
b d e f
Utah 0 NaN 3 NaN
Ohio 3 NaN 6 NaN
Texas 6 NaN 9 NaN
Oregon 9 NaN 12 NaN
[4 rows x 4 columns]
- 如果你希望匹配行且在列上广播,则必须使用算术运算方法
In [98]: series3 = frame['d']
In [99]: frame
Out[99]:
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
[4 rows x 3 columns]
In [100]: series3
Out[100]:
Utah 1
Ohio 4
Texas 7
Oregon 10
Name: d, dtype: float64
In [101]: frame.sub(series3,axis=0)
Out[101]:
b d e
Utah -1 0 1
Ohio -1 0 1
Texas -1 0 1
Oregon -1 0 1
[4 rows x 3 columns]
(5)函数应用和映射
- NumPy 的ufuncs(元素级数组方法)也可用于操作pandas对象
- abs() 函数返回数字的绝对值
<span style="font-family:Microsoft Yahei, Helvetica Neue, Helvetica, Arial, sans-serif;color:#333333;"><span style="line-height: 24px;">In [102]: frame = pd.DataFrame(np.random.randn(4,3),columns=list('bde'),
.....: index=['Utah','Ohio','Texas','Oregon'])
In [103]: frame
Out[103]:
b d e
Utah 0.752307 -1.701850 0.231538
Ohio -1.156549 -0.116461 -0.224967
Texas -1.919040 0.287937 0.580048
Oregon -1.817225 1.545293 -0.606694
[4 rows x 3 columns]
In [104]: np.abs(frame)
Out[104]:
b d e
Utah 0.752307 1.701850 0.231538
Ohio 1.156549 0.116461 0.224967
Texas 1.919040 0.287937 0.580048
Oregon 1.817225 1.545293 0.606694
[4 rows x 3 columns]</span></span>
- DataFrame的apply 方法 将函数应用到由各列或行所形成的一维数组上
-
lambda只是一个表达式,函数体比def简单很多。
lambda的主体是一个表达式,而不是一个代码块。仅仅能在lambda表达式中封装有限的逻辑进去。
lambda表达式是起到一个函数速写的作用。允许在代码内嵌入一个函数的定义。
In [106]: f = lambda x:x.max() - x.min()
In [107]: frame.apply(f)
Out[107]:
b 2.671347
d 3.247143
e 1.186742
dtype: float64
In [108]: frame.apply(f,axis=1)
Out[108]:
Utah 2.454157
Ohio 1.040089
Texas 2.499088
Oregon 3.362518
dtype: float64
- 除了标量外,传递给apply的函数还可以返回由多个值组成的Series
In [120]: def f(x):
return pd.Series([x.min(),x.max()],index=['min','max'])
.....:
In [121]: frame.apply(f)
Out[121]:
b d e
min -1.919040 -1.701850 -0.606694
max 0.752307 1.545293 0.580048
[2 rows x 3 columns]
- 元素级的Python函数也是可以用的
- format 是 得到浮点值的格式化字符串
- 使用applymap函数
In [123]: format = lambda x:'%.2f' % x
In [124]: frame.applymap(format)
Out[124]:
b d e
Utah 0.75 -1.70 0.23
Ohio -1.16 -0.12 -0.22
Texas -1.92 0.29 0.58
Oregon -1.82 1.55 -0.61
[4 rows x 3 columns]
- Series有个应用与元素级别的函数的map方法
In [125]: frame['e'].map(format)
Out[125]:
Utah 0.23
Ohio -0.22
Texas 0.58
Oregon -0.61
Name: e, dtype: object
- 要对行或列索引进行排序(按字典顺序),可以使用sort_index方法,它将返回一个已排序的新对象
In [127]: obj = pd.Series(range(4),index = ['b','a','d','c'])
In [128]: obj.sort_index()
Out[128]:
a 1
b 0
c 3
d 2
dtype: int64
- 对于DataFrame则可以根据任意一个轴上的索引进行排序
In [132]: frame
Out[132]:
d a b c
three 0 1 2 3
one 4 5 6 7
[2 rows x 4 columns]
In [133]: frame.sort_index()
Out[133]:
d a b c
one 4 5 6 7
three 0 1 2 3
[2 rows x 4 columns]
In [134]: frame.sort_index(axis=1)
Out[134]:
a b c d
three 1 2 3 0
one 5 6 7 4
[2 rows x 4 columns]
- 数据默认是按照升序排序的,但也可以降序排序
In [135]: frame.sort_index(axis=1,ascending=False)
Out[135]:
d c b a
three 0 3 2 1
one 4 7 6 5
[2 rows x 4 columns]
- 如果按值对Series进行排序,可使用其order方法
In [137]: obj = pd.Series([4,7,-3,2])
In [138]: obj.order()
Out[138]:
2 -3
3 2
0 4
1 7
dtype: int64
- 在排序时,任何缺失值默认都会被放在Series的末尾
In [139]: obj = pd.Series([4,np.nan,7,np.nan,-3,2])
In [140]: obj.order()
Out[140]:
4 -3
5 2
0 4
2 7
1 NaN
3 NaN
dtype: float64
- 在DataFrame 上,你可能希望根据一个或多个列中的值进行排序
- 我们只要将一个或多个列中的名字传递给by选项即可
In [141]: frame = pd.DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
In [142]: frame
Out[142]:
a b
0 0 4
1 1 7
2 0 -3
3 1 2
[4 rows x 2 columns]
In [143]: frame.sort_index(by='b')
Out[143]:
a b
2 0 -3
3 1 2
0 0 4
1 1 7
[4 rows x 2 columns]
In [144]: frame.sort_index(by=['a','b'])
Out[144]:
a b
2 0 -3
0 0 4
3 1 2
1 1 7
[4 rows x 2 columns]
- 排名:与排序关系密切,它会增设一个排名值(从1开始,一直到数组中有效的数据);有时候它会根据某种规则破坏平级关系
- 默认情况下rank是通过‘为各组分配一个平均排名’的方式破坏平级关系
In [153]: obj = pd.Series([7,-5,7,4,2,0,4])
In [154]: obj.rank()
Out[154]:
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
- 要根据原数据中顺序给出排名
- 也可以按降序进行排名
- DataFrame 可以在行或列上计算排名(只需要声明axis即可),这里不再示范了
In [155]: obj.rank(method='first')
Out[155]:
0 6
1 1
2 7
3 4
4 3
5 2
6 5
dtype: float64
In [156]: obj.rank(ascending=False,method='max')
Out[156]:
0 2
1 7
2 2
3 4
4 5
5 6
6 4
dtype: float64
- 排名时用于破坏平级关系的method选项
(6)带有重复值的轴索引
- 许多pandas函数(如reindex)都要求标签唯一,但这并不是强制性
- 索引的is_unique属性可以告诉你它的值是否是唯一
In [157]: obj = pd.Series(range(5),index = ['a','a','b','b','c'])
In [158]: obj
Out[158]:
a 0
a 1
b 2
b 3
c 4
dtype: int64
In [159]: obj.index.is_unique
Out[159]: False
- 对于带有重复值的索引,数据选取的行为将会有些不同
- 某个索引对应多个值,将会返回一个Series
- 而对应单个值,则返回一个标量值
In [161]: obj['a']
Out[161]:
a 0
a 1
dtype: int64
In [162]: obj['b']
Out[162]:
b 2
b 3
dtype: int64
- 对DataFrame 进行索引的时候也是这样
- 对于重复索引将返回一个DataFrame
In [163]: df = pd.DataFrame(np.random.randn(4,3),index=['a','a','b','b'])
In [164]: df
Out[164]:
0 1 2
a 0.788165 -0.046659 0.781164
a -1.329601 0.245623 0.074344
b 0.126223 0.141285 -2.280768
b 1.605712 -0.450426 -0.083235
[4 rows x 3 columns]
In [165]: df.ix['b']
Out[165]:
0 1 2
b 0.126223 0.141285 -2.280768
b 1.605712 -0.450426 -0.083235
[2 rows x 3 columns]
汇总和计算描述统计:
- pandas对象拥有一组常用的数学和统计方法,他们大部分都属于约简和汇总统计
- 他们都是基于没有缺失数据的假设而构建的
- na值会被自动排除,除非整个切片(整个行或者列)都是na ,通过skipna 选项可以禁用该功能
In [54]: df = pd.DataFrame([[1.4,np.nan],[7.1,-4.5],
[np.nan,np.nan],[0.75,-1.3]],index = ['a','b','c','d'],
....: columns=['one','two'])
In [55]: df
Out[55]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
[4 rows x 2 columns]
In [56]: df.sum()
Out[56]:
one 9.25
two -5.80
dtype: float64
In [57]: df.sum(1)
Out[57]:
a 1.40
b 2.60
c NaN
d -0.55
dtype: float64
In [58]: df.mean(axis=1,skipna=False)
Out[58]:
a NaN
b 1.300
c NaN
d -0.275
dtype: float64
- 约简方法的选项
- 有些方法返回的是间接统计
- idxmin/idxmax
- 有些方法则是累计型的
In [64]: df.idxmax()
Out[64]:
one b
two d
dtype: object
In [65]: df.cumsum()
Out[65]:
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
[4 rows x 2 columns]
- describe 即不是约简型也不是累计型它用于一次性产生多个汇总统计
In [67]: df.describe()
Out[67]:
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
- 对于非数组型数据,describe会产生另外一种汇总统计
In [69]: obj = pd.Series(['a','a','b','c']*4)
In [70]: obj
Out[70]:
0 a
1 a
2 b
3 c
4 a
5 a
6 b
7 c
8 a
9 a
10 b
11 c
12 a
13 a
14 b
15 c
dtype: object
In [71]: obj.describe()
Out[71]:
count 16
unique 3
top a
freq 8
dtype: object
- 描述和汇总统计
(1)相关系数与协方差
- 读取来自Yahoo!Finance的股票价格和成交量
#!/usr/bin/env python
# coding=utf-8
import pandas as pd
import pandas.io.data as web
all_data = {}
for ticker in ['AAPL','IBM','MSFT','GOOG']:
all_data[ticker] = web.get_data_yahoo(ticker,'1/1/2000','1/1/2010')
price = pd.DataFrame({tic:data['Adj Close']
for tic,data in all_data.iteritems()})
volume = pd.DataFrame({tic:data['Volume']
for tic, data in all_data.iteritems()})
- Series的corr方法用于计算两个Series中重叠的,非NA的,按索引对齐的值的相关系数
- cov用于计算协方差
In [33]: returns = price.pct_change()
In [34]: returns.tail()
Out[34]:
AAPL GOOG IBM MSFT
Date
2009-12-24 0.034339 0.011117 0.004385 0.002587
2009-12-28 0.012294 0.007098 0.013326 0.005484
2009-12-29 -0.011861 -0.005571 -0.003477 0.007058
2009-12-30 0.012147 0.005376 0.005461 -0.013699
2009-12-31 -0.004300 -0.004416 -0.012597 -0.015504
[5 rows x 4 columns]
In [35]: returns.MSFT.corr(returns.IBM)
Out[35]: 0.49597969625135241
In [36]: returns.MSFT.cov(returns.IBM)
Out[36]: 0.00021595764700046635
- DataFrame的corr 和 cov 方法将以DataFrame的形式返回完整的相关系数或协方差矩阵
In [38]: returns.corr()
Out[38]:
AAPL GOOG IBM MSFT
AAPL 1.000000 0.470676 0.410011 0.424305
GOOG 0.470676 1.000000 0.390689 0.443587
IBM 0.410011 0.390689 1.000000 0.495980
MSFT 0.424305 0.443587 0.495980 1.000000
[4 rows x 4 columns]
In [39]: returns.cov()
Out[39]:
AAPL GOOG IBM MSFT
AAPL 0.001027 0.000303 0.000252 0.000309
GOOG 0.000303 0.000580 0.000142 0.000205
IBM 0.000252 0.000142 0.000367 0.000216
MSFT 0.000309 0.000205 0.000216 0.000516
[4 rows x 4 columns]
- 利用DataFrame的corrwidth方法,你可以计算其列或行跟另一个Series或DataFrame之间的相关系数
- 传入一个Series将会返回一个相关系数值Series(针对各列进行计算)
In [40]: returns.corrwith(returns.IBM)
Out[40]:
AAPL 0.410011
GOOG 0.390689
IBM 1.000000
MSFT 0.495980
dtype: float64
- 传入一个DataFrame则会计算按列名配对的相关系数,这里,我们计算百分比变换与成交量的相关系数
In [43]: returns.corrwith(volume)
Out[43]:
AAPL -0.057549
GOOG 0.062647
IBM -0.007892
MSFT -0.014245
dtype: float64
- 传入axis=1 即按行进行计算,无论何时,在计算相关系数之前,所有数据项都会按照标签进行对齐
(2)唯一值、值计数以及成员资格
- 这类方法,我们可以从一维Series的值中提取信息
- 第一个函数可以得到Series中的唯一值数组
In [44]: obj = pd.Series(['c','a','d','a','a','b','b','c','c'])
In [47]: uniques = obj.unique()
In [48]: uniques
Out[48]: array(['c', 'a', 'd', 'b'], dtype=object)
- 返回的唯一值是未经排序的,如果需要的话,可以对结果再次进行排序
- value_counts用于计算一个Series中各值出现的频率
In [50]: uniques.sort()
In [51]: uniques
Out[51]: array(['a', 'b', 'c', 'd'], dtype=object)
In [52]: obj.value_counts()
Out[52]:
c 3
a 3
b 2
d 1
dtype: int64
- value_counts是Series是按值频率降序排列的
- value_counts还是一个顶级pandas方法,可用于任何数组或序列
In [53]: pd.value_counts(obj.values,sort=False)
Out[53]:
a 3
c 3
b 2
d 1
dtype: int64
- isin函数用于判断矢量化集合的成员资格
- 就是看obj中的每个元素是否在我们传入函数中的那个集合中
In [55]: mask = obj.isin(['a','c'])
In [56]: mask
Out[56]:
0 True
1 True
2 False
3 True
4 True
5 False
6 False
7 True
8 True
dtype: bool
In [57]: obj[mask]
Out[57]:
0 c
1 a
3 a
4 a
7 c
8 c
dtype: object
- 唯一值、值计数、成员资格方法
- 将pandas.value_counts传给该DataFrame的apply函数
- 行索引就是该DataFrame中的所有唯一数据,该Dataframe的值就是出现的频率
In [58]: data = pd.DataFrame({'Qu1':[1,3,4,3,4],
....: 'Qu2':[2,3,1,2,3],
....: 'Qu3':[1,5,2,4,4]})
In [59]: data
Out[59]:
Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
[5 rows x 3 columns]
In [60]: result = data.apply(pd.value_counts).fillna(0)
In [61]: result
Out[61]:
Qu1 Qu2 Qu3
1 1 1 1
2 0 2 1
3 2 2 0
4 2 0 2
5 0 0 1
[5 rows x 3 columns]
处理缺失数据:
- pandas使用浮点值NaN表示浮点和非浮点数组中的缺失数据,他只是一个便于被检测出来的标记而已
- python 内置的None 值也会被当做NA 处理
In [66]: string_data = pd.Series(['aardvark','artichoke',np.nan,'avocado'])
In [67]: string_data
Out[67]:
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
In [69]: string_data.isnull()
Out[69]:
0 False
1 False
2 True
3 False
dtype: bool
In [70]: string_data[0] = None
In [71]: string_data.isnull()
Out[71]:
0 True
1 False
2 True
3 False
dtype: bool
- NA处理方法
(1)滤除缺失数据
- dropna返回一个仅含非空数据和索引值的Series
- 通过布尔型索引也可以达到这个目的
In [72]: from numpy import nan as NA
In [73]: data = pd.Series([1,NA,3.5,NA,7])
In [74]: data.dropna()
Out[74]:
0 1.0
2 3.5
4 7.0
dtype: float64
In [75]: data[data.notnull()]
Out[75]:
0 1.0
2 3.5
4 7.0
dtype: float64
- 对于DataFrame 对象,事情就有点复杂了
- dropna默认丢弃任何含有缺失值的行
- 传入how=‘all' 将只丢弃全为NA的那些行
- 传入axis = 1 将删除全为NA的那些列
In [76]: data = pd.DataFrame([[1.,6.5,3.],[1.,NA,NA],
....: [NA,NA,NA],[NA,6.5,3.]])
In [77]: cleaned = data.dropna()
In [78]: data
Out[78]:
0 1 2
0 1 6.5 3
1 1 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3
[4 rows x 3 columns]
In [79]: cleaned
Out[79]:
0 1 2
0 1 6.5 3
[1 rows x 3 columns]
In [80]: data.dropna(how='all')
Out[80]:
0 1 2
0 1 6.5 3
1 1 NaN NaN
3 NaN 6.5 3
[3 rows x 3 columns]
In [81]: data.dropna(axis=1,how='all')
Out[81]:
0 1 2
0 1 6.5 3
1 1 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3
[4 rows x 3 columns]
- 当你只想留下某些数据时,可以利用thresh 参数来设定
In [90]: df.ix[:4,1] = NA;df.ix[:3,2] = NA
In [91]: df
Out[91]:
0 1 2
0 0.739899 NaN NaN
1 0.124309 NaN NaN
2 0.960898 NaN NaN
3 -0.059859 NaN NaN
4 -1.917175 NaN -1.122671
5 -0.073825 1.384257 -1.266332
6 0.957087 0.185031 -1.241783
[7 rows x 3 columns]
In [92]: df.dropna(thresh=4)
Out[92]:
Empty DataFrame
Columns: [0, 1, 2]
Index: []
[0 rows x 3 columns]
In [93]: df.dropna(thresh=3)
Out[93]:
0 1 2
5 -0.073825 1.384257 -1.266332
6 0.957087 0.185031 -1.241783
[2 rows x 3 columns]
(2)填充缺失数据
- fillna方法
In [97]: df.fillna(0)
Out[97]:
0 1 2
0 0.739899 0.000000 0.000000
1 0.124309 0.000000 0.000000
2 0.960898 0.000000 0.000000
3 -0.059859 0.000000 0.000000
4 -1.917175 0.000000 -1.122671
5 -0.073825 1.384257 -1.266332
6 0.957087 0.185031 -1.241783
[7 rows x 3 columns]
In [99]: df.fillna({1:0.5,2:-1})
Out[99]:
0 1 2
0 0.739899 0.500000 -1.000000
1 0.124309 0.500000 -1.000000
2 0.960898 0.500000 -1.000000
3 -0.059859 0.500000 -1.000000
4 -1.917175 0.500000 -1.122671
5 -0.073825 1.384257 -1.266332
6 0.957087 0.185031 -1.241783
[7 rows x 3 columns]
- fillna 默认会返回新对象,但也可以对现有对象进行就地修改
In [100]: _ = df.fillna(0,inplace=True)
In [101]: df
Out[101]:
0 1 2
0 0.739899 0.000000 0.000000
1 0.124309 0.000000 0.000000
2 0.960898 0.000000 0.000000
3 -0.059859 0.000000 0.000000
4 -1.917175 0.000000 -1.122671
5 -0.073825 1.384257 -1.266332
6 0.957087 0.185031 -1.241783
[7 rows x 3 columns]
- fillna函数的参数
层次化索引:
- 使用低维度形式处理高维度数据
- 创建一个Series,并用一个由列表或数组组成的列表作为索引
- 这就是带有MultiIndex索引的Series的格式化输出形式
- 索引之间的间隔表示直接使用上面的标签
- 对于一个层次化索引的对象,选取数据子集的操作很简单
In [105]: data = pd.Series(np.random.randn(10),
.....: index = [['a','a','a','b','b','b','c','c','d','d'],
.....: [1,2,3,1,2,3,1,2,2,3]])
In [106]: data
Out[106]:
a 1 -0.237206
2 -0.992311
3 1.685961
b 1 0.987261
2 -1.166006
3 -0.962065
c 1 -1.071484
2 0.393728
d 2 0.793652
3 -0.223266
dtype: float64
In [107]: data.index
Out[107]:
MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
In [108]: data['b']
Out[108]:
1 0.987261
2 -1.166006
3 -0.962065
dtype: float64
- 可以在“内层”中进行选取
In [109]: data[:,2]
Out[109]:
a -0.992311
b -1.166006
c 0.393728
d 0.793652
dtype: float64
- 层次化索引在数据重塑和基于分组的操作中扮演着重要的角色
- 可以利用unstack 方法被重新安排到一个DataFrame中
In [110]: data.unstack()
Out[110]:
1 2 3
a -0.237206 -0.992311 1.685961
b 0.987261 -1.166006 -0.962065
c -1.071484 0.393728 NaN
d NaN 0.793652 -0.223266
[4 rows x 3 columns]
In [111]: data.unstack().stack()
Out[111]:
a 1 -0.237206
2 -0.992311
3 1.685961
b 1 0.987261
2 -1.166006
3 -0.962065
c 1 -1.071484
2 0.393728
d 2 0.793652
3 -0.223266
dtype: float64
- 对于一个DataFrame,每条轴都可以有分层索引
- 每层都可以有名字;如果制定了名称,他们就会显示在控制台输出中
- 不要将索引名跟轴标签混为一谈
In [113]: frame = pd.DataFrame(np.arange(12).reshape((4,3)),
index = [['a','a','b','b'],[1,2,1,2]],
columns = [['Ohio',"Ohio",'Colorado'],
['Green','Red','Green']])
In [114]: frame
Out[114]:
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
[4 rows x 3 columns]
In [115]: frame.index.names = ['key1','key2']
In [116]: frame.columns.names = ['state','color']
In [117]: frame
Out[117]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
[4 rows x 3 columns]
- 可以单独创建MultiIndex 然后复用,上面那个DataFrame中的(分级)列可以这样创建
In [125]: col = pd.MultiIndex.from_arrays([['Ohio','Ohio','Colorado'],['Green','Red','Green']],names=['state','color'])
In [126]: frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
index = [['a','a','b','b'],[1,2,1,2]],
columns = col )
In [127]: frame2
Out[127]:
state Ohio Colorado
color Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
[4 rows x 3 columns]
(1)重排分级顺序
- swaplevel 接受两个级别编号或名称,并返回一个互换了级别的新对象
In [129]: frame.swaplevel('key1','key2')
Out[129]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
[4 rows x 3 columns]
- sortlevel 则根据单个级别中的值对数据进行排序
- 交换级别时,常常也会用到sortlevel
In [130]: frame.sortlevel(1)
Out[130]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
[4 rows x 3 columns]
In [131]: frame
Out[131]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
[4 rows x 3 columns]
In [132]: frame.sortlevel(0)
Out[132]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
[4 rows x 3 columns]
In [134]: frame.swaplevel(0,1)
Out[134]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
[4 rows x 3 columns]
In [135]: frame
Out[135]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
[4 rows x 3 columns]
In [136]: frame.swaplevel(0,1).sortlevel(0)
Out[136]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
b 6 7 8
2 a 3 4 5
b 9 10 11
[4 rows x 3 columns]
(2)根据级别汇总统计
- 我们可以根据行或列上的级别来进行求和
In [137]: frame.sum(level='key2')
Out[137]:
state Ohio Colorado
color Green Red Green
key2
1 6 8 10
2 12 14 16
[2 rows x 3 columns]
In [138]: frame.sum(level='color',axis=1)
Out[138]:
color Green Red
key1 key2
a 1 2 1
2 8 4
b 1 14 7
2 20 10
[4 rows x 2 columns]
(3)使用DataFrame的列
- DataFrame 的 set_index 函数会将其一个或多个列转换为行索引,并创建一个新的DataFrame
In [139]: frame = pd.DataFrame({'a':range(7),'b':range(7,0,-1),
.....: 'c':['one','one','one','two','two','two','two'],
.....: 'd':[0,1,2,0,1,2,3]})
In [140]: frame
Out[140]:
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
[7 rows x 4 columns]
In [141]: frame2 = frame.set_index(['c','d'])
In [142]: frame2
Out[142]:
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
[7 rows x 2 columns]
- 默认情况下,那些列会从DataFrame中移除,但也可以将其保留下来
- reset_index 的功能跟set_index刚好相反,层次化索引的级别会被转移到列里面
In [143]: frame.set_index(['c','d'],drop=False)
Out[143]:
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
[7 rows x 4 columns]
In [144]: frame2.reset_index()
Out[144]:
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
[7 rows x 4 columns]