第五章:pandas入门

1. pandas的数据结构介绍

Series

Series是一种类似于一维数组的对象,它由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成,如果未指定索引,则默认生成0~N-1的整数型索引:

In [4]: obj = Series([4, 7, -5, 3])

In [5]: obj
Out[5]: 
0    4
1    7
2   -5
3    3
dtype: int64



我们可以为其指定索引,并且通过索引来获取值:


In [11]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [12]: obj2.values
Out[12]: array([ 4,  7, -5,  3])

In [13]: obj2.index
Out[13]: Index([u'd', u'b', u'a', u'c'], dtype='object')

In [14]: obj2['a']
Out[14]: -5

In [15]: obj2[['a']]
Out[15]: 
a   -5
dtype: int64

In [16]: obj2[['b', 'a']]
Out[16]: 
b    7
a   -5
dtype: int64



而类似NumPy运算,会保留索引和值之间的链接:


In [17]: obj2[obj2 > 0]
Out[17]: 
d    4
b    7
c    3
dtype: int64

In [18]: obj2 * 2
Out[18]: 
d     8
b    14
a   -10
c     6
dtype: int64



还可以将Series看成是一个定长的有序字典,因为它是索引值到数据值的一个映射.


In [19]: 'b' in obj2
Out[19]: True

In [20]: 'e' in obj2
Out[20]: False



如果数据被存放在一个Python字典中,也可以直接通过这个字典来创建Series:


In [21]: sdata = {'Ohio' : 35000, 'Texas' : 71000, 'Oregon' : 16000, 'Utah' : 5000}

In [22]: obj3 = Series(sdata)

In [23]: obj3
Out[23]: 
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64



如果只传入一个字典,则结果Series中的索引就是原字典的键:


In [24]: states = ['California', 'Ohio', 'Oregon', 'Texas']

In [25]: obj4 = Series(sdata, index=states)

In [26]: obj4
Out[26]: 
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
dtype: float64



备注:


1. 对于Series来说,是由index找到value值.

2. NA代表缺失数据

3. Series不同于字典的是:它依旧是顺序存储的

    pandas的isnull和notnull函数可用于检测缺失数据:

In [27]: pd.isnull(obj4)
Out[27]: 
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [28]: pd.notnull(obj4)
Out[28]: 
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool



Series最重要的功能是: 它在算术运算中会自动对齐不同的索引:


In [29]: obj3
Out[29]: 
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [30]: obj4
Out[30]: 
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
dtype: float64

In [31]: obj3 + obj4
Out[31]: 
California       NaN
Ohio           70000
Oregon         32000
Texas         142000
Utah             NaN
dtype: float64



Series对象本身及其索引都有一个name属性,该属性跟pandas其他的关键功能关系非常密切:


In [32]: obj4.name = 'population'

In [33]: obj4.index.name = 'state'

In [34]: obj4
Out[34]: 
state
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
Name: population, dtype: float64



Series的索引可以通过赋值的方式就地修改:


In [36]: obj
Out[36]: 
0    4
1    7
2   -5
3    3
dtype: int64

In [37]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [38]: obj
Out[38]: 
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64



DataFrame


DataFrame是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等).DataFrame既有行索引又有列索引,它可以被看做由Series组成的字典(共用同一个索引).

构建DataFrame的办法有很多,最常用的一种是直接传入一个由等长列表或NumPy数组:

In [41]: data = {'state' : ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 
        'year' : [2000, 2001, 2002, 2001, 2002],
        'pop' : [1.5, 3.7, 3.6, 2.4, 2.9]}

In [42]: frame = DataFrame(data)

In [43]: frame
Out[43]: 
   pop   state  year
0  1.5    Ohio  2000
1  3.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002



结果DataFrame会自动加上索引,且全部列会被有序排列. 如果指定了列序列,则DataFrame的列就会按照指定顺序进行排列;并且如果传入的列在数据中找不到,就会产生NA值:


In [44]: DataFrame(data, columns=['year', 'state', 'pop'])
Out[44]: 
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  3.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9

In [46]: frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                        index=['one','two','three','four','five'])

In [47]: frame2
Out[47]: 
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  3.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN



通过类似字典标记的方式或属性的方式,可以将DataFrame的列获取未一个Series:


In [48]: frame2['state']
Out[48]: 
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [49]: frame2.year
Out[49]: 
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64



注意,返回的Series拥有原DataFrame相同的索引,且其name属性也已经被相应的设置好了.行也可以通过位置或名称方式进行获取,比如用索引字典ix:


In [50]: frame2.ix['three']
Out[50]: 
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object



列可以通过赋值的方式进行修改.但是将列表或数组赋给某个列时,其长度必须跟DataFrame的长度相匹配:


In [56]: frame2['debt'] = 16.5

In [57]: frame2
Out[57]: 
       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  3.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5

In [58]: frame2['debt'] = np.arange(5.)

In [59]: frame2
Out[59]: 
       year   state  pop  debt
one    2000    Ohio  1.5     0
two    2001    Ohio  3.7     1
three  2002    Ohio  3.6     2
four   2001  Nevada  2.4     3
five   2002  Nevada  2.9     4



如果我们编写:


In [60]: frame2['debt'] = np.arange(6.)



则会报异常.


    如果赋值的是一个Series,就会精确分配DataFrame的索引,所有的空位都将被填上缺失值:

In [62]: val = Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])

In [63]: frame2['debt'] = val

In [64]: frame2
Out[64]: 
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  3.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7



为不存在的列赋值会创建一个新列.关键字del用于删除列:


In [68]: frame2
Out[68]: 
       year   state  pop  debt eastern
one    2000    Ohio  1.5   NaN    True
two    2001    Ohio  3.7  -1.2    True
three  2002    Ohio  3.6   NaN    True
four   2001  Nevada  2.4  -1.5   False
five   2002  Nevada  2.9  -1.7   False

In [69]: del frame2['eastern']

In [70]: frame2
Out[70]: 
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  3.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7



另一种常见的数据形式是嵌套字典:


In [71]: pop = {'Nevada' : {2001 : 2.4, 2002 : 2.9},
   ....: 'Ohio' : {2000 : 1.5, 2001 : 1.7, 2002 : 3.6}}

In [72]: frame3 = DataFrame(pop)

In [74]: frame3
Out[74]: 
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6



内层字典的键会被合并,排序以形成最终的索引.如果显式指定了索引,则不会这样:


In [75]: DataFrame(pop, index=[2001, 2002, 2003])
Out[75]: 
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     NaN   NaN



由Series组成的字典差不多也是一样的用法:


In [77]: pdata = {'Ohio' : frame3['Ohio'][:-1],
   ....:        'Nevada' : frame3['Nevada'][:2]}

In [78]: pdata
Out[78]: 
{'Nevada': 2000    NaN
 2001    2.4
 Name: Nevada, dtype: float64, 'Ohio': 2000    1.5
 2001    1.7
 Name: Ohio, dtype: float64}

In [79]: DataFrame(pdata)
Out[79]: 
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7



可以输入给DataFrame构造器的数据


类型 说明
二维ndarray 数据矩阵,还可以传入行标和列标
由数组,列表或元祖组成的字典 每个序列会变成DataFrame的一列.所有序列的长度必须相同
NumPy的结构化/记录数组 类似于"由数组组成的字典"
由Series组成的字典 每个Series会成为一列.如果没有显式的指定索引,则各Series得索引会被合并成结果的行索引
由字典组成的字典 各内层字典会成为一列.键会被合并成结果的行索引,跟"由Series组成的字典"的情况一样
字典或Series的列表 各项将会成为DataFrame的一行.字典键或Series索引的并集将会成为DataFrame的列标
由列表或元祖组成的列表 类似于"二维ndarray"
另一个DataFrame 该DataFrame的索引将会被沿用,除非显式指定了其他索引
NumPy的MaskedArray 类似于"二维ndarray"的情况,只是掩码值在结果DataFrame会变成NA缺失值

如果设置了DataFrame的index和columns的name属性,则这些信息也会被显示出来:

In [80]: frame3.index.name = 'year'; frame3.columns.name = 'state'

In [81]: frame3
Out[81]: 
state  Nevada  Ohio
year               
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6



跟Series一样,values属性也会以二维ndarray的形式返回DataFrame中的数据:


In [82]: frame3.values
Out[82]: 
array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

In [83]: frame3
Out[83]: 
state  Nevada  Ohio
year               
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6



如果DataFrame各列的数据类型不同,则值数组的数据类型就会选用能兼容所有列的数据类型:


In [84]: frame2.values
Out[84]: 
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 3.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

In [85]: frame2
Out[85]: 
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  3.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7



索引对象

pandas的索引对象负责管理轴标签和其他原数据.构建Series或DataFrame时,所用到的任何数组或其他序列的标签都会被转换成一个Index:


In [86]: obj = Series(range(3), index=['a', 'b', 'c'])

In [87]: index = obj.index

In [88]: index
Out[88]: Index([u'a', u'b', u'c'], dtype='object')



Index对象不可修改的.正是因为不可修改,所以才能使Index对象在多个数据结构之间安全的共享:


In [89]: index = pd.Index(np.arange(3))

In [90]: obj2 = Series([1.5, -2.5, 0], index = index)

In [91]: obj2.index is index
Out[91]: True



index的方法和属性


方法 说明
append 选择另一个index对象,产生一个新的index
diff 计算差集,并得到一个index
intersection 计算交集
union 计算并集
isin 计算一个指示各值是否都包含在参数集合中的布尔型数组
delete 删除索引i处的元素,并得到新的index
drop 删除传入的值,并得到新的index
insert 将元素插入到索引i处,并得到新的index
is_monotonic 当各元素均大于等于前一个元素时,返回True
is_unique 当index没有重复值时,返回True
unique 计算Index中唯一值的数组


2. 基本功能

重新索引

pandas对象的一个重要方法是reindex,其作用是创建一个适应新索引的新对象:

In [4]: obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

In [5]: obj
Out[5]:
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64



调用该Series的reindex将会根据新索引进行重排.如果某个索引值当前不存在,就引入缺失值:


In [6]: obj2 = obj.reindex(['a','b','c','d','e'])

In [7]: obj2
Out[7]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [8]: obj2 = obj.reindex(['a','b','c','d','e'], fill_value=0)

In [9]: obj2
Out[9]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64



我们可以通过ffill实现前向值填充:


In [10]: obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])

In [11]: obj3
Out[11]:
0      blue
2    purple
4    yellow
dtype: object

In [12]: obj3.reindex(range(6), method='ffill')
Out[12]:
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object



reindex的method选项


参数 说明
ffill或pad 前向填充(或搬运)值
bfill或backfill 向后填充(或搬运)值

    对于DataFrame,reindex可以修改(行)索引,列,或两个都修改.如果近传入一个序列,则会重新索引行:

In [15]: frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a','c','d'],columns=['Ohio','Texas','California'])

In [16]: frame
Out[16]:
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

In [17]: frame2 = frame.reindex(['a','b','c','d'])

In [18]: frame2
Out[18]:
   Ohio  Texas  California
a     0      1           2
b   NaN    NaN         NaN
c     3      4           5
d     6      7           8



利用columns关键字即可重新索引列:


In [19]: states = ['Texas', 'Utah', 'California']

In [20]: frame.reindex(columns=states)
Out[20]:
   Texas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8



也可以同时对行和列进行重新索引,而插值则只能按行应用(即轴0):


In [21]: frame.reindex(index=['a','b','c','d'], method='ffill',columns=states)
Out[21]:
   Texas  Utah  California
a      1   NaN           2
b      1   NaN           2
c      4   NaN           5
d      7   NaN           8



利用ix的标签索引功能,重新索引任务可以变得更简洁:


In [23]: frame.ix[['a','b','c','d'], states]
Out[23]:
   Texas  Utah  California
a      1   NaN           2
b    NaN   NaN         NaN
c      4   NaN           5
d      7   NaN           8



reindex函数的参数
参数 说明
index 用作索引的新序列.
method 插值(填充)方式
fill_value 在重新索引的过程中,需要引入缺失值时使用的替换值
limit 前向或后向填充时的最大填充量
level 在MultiIndex的指定级别上匹配简单索引,否则选取其子集
copy 默认为True,无论如何都复制;如果为False,则新旧相等就不复制

丢弃指定轴上的项

    使用drop方法返回的是一个在指定轴上删除了指定值的新对象:


In [24]: obj = Series(np.arange(5.), index=['a','b','c','d','e'])

In [25]: new_obj = obj.drop('c')

In [26]: new_obj
Out[26]:
a    0
b    1
d    3
e    4
dtype: float64

In [27]: obj.drop(['d','c'])
Out[27]:
a    0
b    1
e    4
dtype: float64



对于DataFrame,可以删除任意轴上的索引值:


In [28]: data = DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])

In [29]: data.drop(['Colorado','Ohio'])
Out[29]:
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15

In [30]: data.drop('two', axis=1)
Out[30]:
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15

In [32]: data.drop(['two','four'], axis=1)
Out[32]:
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14



索引,选取和过滤


Series的索引值不只是整数:

In [33]: obj = Series(np.arange(4.), index=['a','b','c','d'])

In [34]: obj['b']
Out[34]: 1.0

In [35]: obj[1]
Out[35]: 1.0

In [36]: obj[2:4]
Out[36]:
c    2
d    3
dtype: float64

In [37]: obj[['b','a','d']]
Out[37]:
b    1
a    0
d    3
dtype: float64

In [38]: obj[[1,3]]
Out[38]:
b    1
d    3
dtype: float64

In [39]: obj[obj<2]
Out[39]:
a    0
b    1
dtype: float64



利用标签的切片运算与普通的Python切片运算不同,其末端是包含的:


In [40]: obj['b':'c']
Out[40]:
b    1
c    2
dtype: float64



    对DataFrame进行索引其实就是获取一个或多个列:


In [41]: data = DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])

In [42]: data
Out[42]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

In [43]: data['two']
Out[43]:
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [44]: data[['three','one']]
Out[44]:
          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12



这种索引方式有几个特殊的情况.首先通过切片或布尔型数组选取行:


In [45]: data[:2]
Out[45]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7

In [46]: data[data['three'] > 5]
Out[46]:
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15



另一种用法是通过布尔型DataFrame进行索引:


In [47]: data < 5
Out[47]:
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False

In [48]: data[data < 5] = 0

In [49]: data
Out[49]:
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15



为了在DataFrame的行上进行标签索引,可以使用索引字段ix,它通过NumPy式的标记法以及轴标签从DataFrame中选取行和列的子集:


In [50]: data.ix['Colorado', ['two', 'three']]
Out[50]:
two      5
three    6
Name: Colorado, dtype: int64

In [51]: data.ix[['Colorado','Utah'],[3,0,1]]
Out[51]:
          four  one  two
Colorado     7    0    5
Utah        11    8    9

In [52]: data.ix[2]
Out[52]:
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [53]: data.ix[:'Utah', 'two']
Out[53]:
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

In [54]: data.ix[data.three>5,:3]
Out[54]:
          one  two  three
Colorado    0    5      6
Utah        8    9     10
New York   12   13     14



DataFrame的索引选项


类型 说明
obj[val] 选取DataFrame的个列或一组列.在一些特殊情况下会比较便利:布尔型数组(过滤行),切片(行切片),布尔型DataFrame(根据条件设置值)
obj.ix[val] 选取DataFrame的单个行或一组行
obj.ix[:,val] 选取单个列或列子集
obj.ix[val1, val2] 同时选取行和列
reindex方法 将一个或多个轴匹配到新索引
xs方法 根据标签选取单行或单列,并返回一个Series
icol, irow方法 根据整数位置选取单列或单行,并返回一个Series
get_value,set_value方法 根据行标签和列标签选取单个值

算术运算和数据对齐


pandas可以对不同索引的对象进行算术运算.在将对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集:

In [55]: s1 = Series([7.3, -2.5, 3.4, 1.5],index=['a','c','d','e'])

In [56]: s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a','c','e','f','g'])

In [57]: s1
Out[57]:
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [58]: s2
Out[58]:
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [59]: s1 + s2
Out[59]:
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64



自动的数据对齐操作在不重叠的索引处引入了NA值.缺失值会在算术运算过程中传播.对于DataFrame,对齐操作会同时发生在行列上:


In [60]: df1 = DataFrame(np.arange(9.).reshape((3,3)), columns=list('bcd'),index=['Ohio','Texas','Colorado'])

In [61]: df2 = DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])

In [62]: df1
Out[62]:
          b  c  d
Ohio      0  1  2
Texas     3  4  5
Colorado  6  7  8

In [63]: df2
Out[63]:
        b   d   e
Utah    0   1   2
Ohio    3   4   5
Texas   6   7   8
Oregon  9  10  11

In [64]: df1 + df2
Out[64]:
           b   c   d   e
Colorado NaN NaN NaN NaN
Ohio       3 NaN   6 NaN
Oregon   NaN NaN NaN NaN
Texas      9 NaN  12 NaN
Utah     NaN NaN NaN NaN



在算术方法中填充值


我们可以使用方法(add,sub.div,mul)而非操作符(+,-,*,/),并通过传递参数来填充一个特殊值:


In [65]: df1 = DataFrame(np.arange(12.).reshape((3,4)), columns=list('abcd'))

In [66]: df2 = DataFrame(np.arange(20.).reshape((4,5)), columns=list('abcde'))

In [67]: df1.add(df2, fill_value=0)
Out[67]:
    a   b   c   d   e
0   0   2   4   6   4
1   9  11  13  15   9
2  18  20  22  24  14
3  15  16  17  18  19



DataFrame和Series之间的运算


一个广播的例子:


In [68]: arr = np.arange(12.).reshape((3,4))

In [69]: arr
Out[69]:
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

In [70]: arr[0]
Out[70]: array([ 0.,  1.,  2.,  3.])

In [71]: arr - arr[0]
Out[71]:
array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])



而DataFrame和Series之间的运算和广播差不多:



In [73]: frame = DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])

In [74]: series = frame.ix[0]

In [75]: frame
Out[75]:
        b   d   e
Utah    0   1   2
Ohio    3   4   5
Texas   6   7   8
Oregon  9  10  11

In [76]: series
Out[76]:
b    0
d    1
e    2
Name: Utah, dtype: float64

In [77]: frame - series
Out[77]:
        b  d  e
Utah    0  0  0
Ohio    3  3  3
Texas   6  6  6
Oregon  9  9  9



如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集:



In [78]: series2 = Series(range(3), index=['b','e','f'])

In [79]: frame + series2
Out[79]:
        b   d   e   f
Utah    0 NaN   3 NaN
Ohio    3 NaN   6 NaN
Texas   6 NaN   9 NaN
Oregon  9 NaN  12 NaN



如果你希望匹配行且在列上广播,则必须使用算术运算方法:



In [80]: series3 = frame['d']

In [81]: frame
Out[81]:
        b   d   e
Utah    0   1   2
Ohio    3   4   5
Texas   6   7   8
Oregon  9  10  11

In [82]: series3
Out[82]:
Utah       1
Ohio       4
Texas      7
Oregon    10
Name: d, dtype: float64

In [83]: frame.sub(series3, axis=0)
Out[83]:
        b  d  e
Utah   -1  0  1
Ohio   -1  0  1
Texas  -1  0  1
Oregon -1  0  1



函数应用和映射


NumPy的ufuncs(元素级数组方法)也可用于操作pandas对象:

In [84]: frame = DataFrame(np.random.randn(4,3), columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])

In [85]: frame
Out[85]:
               b         d         e
Utah    1.998126  0.446430  0.740153
Ohio    1.300625 -1.165500 -1.654962
Texas  -0.715259  0.642894 -1.302104
Oregon  0.110120  0.131228  0.384717

In [86]: np.abs(frame)
Out[86]:
               b         d         e
Utah    1.998126  0.446430  0.740153
Ohio    1.300625  1.165500  1.654962
Texas   0.715259  0.642894  1.302104
Oregon  0.110120  0.131228  0.384717



另一个常见的操作是:将函数应用到由各列或行所形成的一维数组上,可通过apply来完成:


In [87]: f = lambda x: x.max() - x.min()

In [88]: frame.apply(f)
Out[88]:
b    2.713386
d    1.808394
e    2.395114
dtype: float64

In [89]: frame.apply(f, axis=1)
Out[89]:
Utah      1.551697
Ohio      2.955586
Texas     1.944998
Oregon    0.274597
dtype: float64



除标量值外,传递给apply的函数还可以返回由多个值组成的Series:


In [90]: def f(x):
   ....:     return Series([x.min(), x.max()], index=['min','max'])
   ....:

In [91]: frame.apply(f)
Out[91]:
            b         d         e
min -0.715259 -1.165500 -1.654962
max  1.998126  0.642894  0.740153



元素级的Python函数也是可以用的:


In [96]: format = lambda x: '%.2f' % x

In [97]: frame.applymap(format)
Out[97]:
            b      d      e
Utah     2.00   0.45   0.74
Ohio     1.30  -1.17  -1.65
Texas   -0.72   0.64  -1.30
Oregon   0.11   0.13   0.38

In [98]: frame['e'].map(format)
Out[98]:
Utah       0.74
Ohio      -1.65
Texas     -1.30
Oregon     0.38
Name: e, dtype: object



排序和排名


通常使用sort_index方法来进行排序:


In [99]: obj = Series(range(4), index=['d','a','b','c'])

In [100]: obj.sort_index()
Out[100]:
a    1
b    2
c    3
d    0
dtype: int64



而对于DataFrame,则可以根据任意一个轴上的索引进行排序:



In [101]: frame = DataFrame(np.arange(8).reshape((2,4)), index=['three','one'],columns=['d','a','b','c'])

In [102]: frame.sort_index()
Out[102]:
       d  a  b  c
one    4  5  6  7
three  0  1  2  3

In [103]: frame.sort_index(axis=1)
Out[103]:
       a  b  c  d
three  1  2  3  0
one    5  6  7  4



数据默认是按升序排序的,但也可以降序排序:



In [104]: frame.sort_index(axis=1,ascending=False)
Out[104]:
       d  c  b  a
three  0  3  2  1
one    4  7  6  5



若要按值对Series进行排序,可使用其order方法:



In [105]: obj=Series([4,7,-3,2])

In [106]: obj.order()
Out[106]:
2   -3
3    2
0    4
1    7
dtype: int64



在排序时,任何缺失值默认都会被放到Series的末尾:



In [107]: obj = Series([4, np.nan, 7, np.nan, -3, 2])

In [108]: obj.order()
Out[108]:
4    -3
5     2
0     4
2     7
1   NaN
3   NaN
dtype: float64


在DataFrame上,你可能希望根据一个或多个列中的值进行排序.将一个或多个列的名字传递给by选项即可达到该目的:

In [109]: frame = DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})

In [110]: frame
Out[110]:
   a  b
0  0  4
1  1  7
2  0 -3
3  1  2

In [111]: frame.sort_index(by='b')
Out[111]:
   a  b
2  0 -3
3  1  2
0  0  4
1  1  7

In [112]: frame.sort_index(by=['a','b'])
Out[112]:
   a  b
2  0 -3
0  0  4
3  1  2
1  1  7



备注: ranking无法理解


带有重复值的轴索引

    如果索引对应多个值,则返回一个Series,而对应单个值时,则返回一个标量值:

In [113]: obj = Series(range(5), index=['a','a','b','b','c'])

In [114]: obj.index.is_unique
Out[114]: False

In [115]: obj['a']
Out[115]:
a    0
a    1
dtype: int64

In [116]: obj['c']
Out[116]: 4



而DataFrame的索引也是如此:


In [117]: df = DataFrame(np.random.randn(4,3),index=['a','a','b','b'])

In [118]: df
Out[118]:
          0         1         2
a  1.220948  0.747138 -0.993805
a  0.556358 -0.822418  0.077788
b  2.629853  0.110979 -0.308346
b  0.509241 -0.843611 -0.879884

In [119]: df.ix['b']
Out[119]:
          0         1         2
b  2.629853  0.110979 -0.308346
b  0.509241 -0.843611 -0.879884














转载于:https://my.oschina.net/voler/blog/475481

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值