pandas基本操作(二)

二、基本功能

1. 重建索引

reindex是pandas对象的重要方法,该方法用于创建一个符合新索引的新对象:

import numpy as np
import pandas as pd
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

对于顺序数据,比如时间序列,在重建索引时可能会需要进行插值或者填值。method可选参数允许我们使用诸如ffill等方法在重建索引时插值,ffill方法会将值前向填充:

obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

在dataframe中,reindex可以改变行索引、列索引、也可以同时改变二者。当仅传入一个序列时,结果中的行会重建索引:


frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
frame
OhioTexasCalifornia
a012
c345
d678
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
OhioTexasCalifornia
a0.01.02.0
bNaNNaNNaN
c3.04.05.0
d6.07.08.0
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)
TexasUtahCalifornia
a1NaN2
c4NaN5
d7NaN8

你可以用loc进行更为简洁的标签索引:

frame.loc[['a', 'b', 'c', 'd'], states]
TexasUtahCalifornia
a1.0NaN2.0
bNaNNaNNaN
c4.0NaN5.0
d7.0NaN8.0

reindex方法的参数:

参数说明
index新建作为索引的序列
method填充方式,ffill为前向填充,bfill是后向填充
fill_value引入的缺失数据值
limit填充间隙
tolerance所需填充的不精确匹配下的最大尺寸间隙
level在多层索引上匹配简单索引
copy如果weiTrue则复制数据

2. 轴向上删除条目

drop方法会返回一个含有指示值或轴向上删除值的新对象:

obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
new_obj = obj.drop('c')
new_obj
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
obj.drop(['d', 'c'])
a    0.0
b    1.0
e    4.0
dtype: float64
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data
onetwothreefour
Ohio0123
Colorado4567
Utah891011
New York12131415

调用drop时使用标签序列会根据行标签删除值(轴0):

data.drop(['Colorado', 'Ohio'])
onetwothreefour
Utah891011
New York12131415

你可以通过传递axis=1或axis='columns’来从列中删除值:

data.drop('two', axis=1)
onethreefour
Ohio023
Colorado467
Utah81011
New York121415
data.drop(['two', 'four'], axis='columns')
onethree
Ohio02
Colorado46
Utah810
New York1214

很多函数,例如drop,会修改Series或Dataframe的尺寸或形状,这些方法直接操作原对象而不返回新对象:

obj.drop('c', inplace=True)
obj
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

3. 索引、选择与过滤

Series的索引与numpy数组索引的功能类似,只不过普通python切片不包含尾部,而series的切片不同:

obj['c'] = 2.0
obj = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
obj['b':'c']
b    1.0
c    2.0
dtype: float64
obj['b':'c'] = 5
obj
a    0.0
b    5.0
c    5.0
d    3.0
e    4.0
dtype: float64
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data
onetwothreefour
Ohio0123
Colorado4567
Utah891011
New York12131415
data['two']
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64
data[['three', 'one']]
threeone
Ohio20
Colorado64
Utah108
New York1412

这种方式也有特殊案例。首先,可以根据一个布尔值数组切片或选择数据:

data[:2]
onetwothreefour
Ohio0123
Colorado4567
data[data['three'] > 5]
onetwothreefour
Colorado4567
Utah891011
New York12131415
data < 5
onetwothreefour
OhioTrueTrueTrueTrue
ColoradoTrueFalseFalseFalse
UtahFalseFalseFalseFalse
New YorkFalseFalseFalseFalse
data[data < 5] = 0
data
onetwothreefour
Ohio0000
Colorado0567
Utah891011
New York12131415

使用loc和iloc选额数据

data.loc['Colorado', ['two', 'three']]
two      5
three    6
Name: Colorado, dtype: int64
data.iloc[2, [3, 0, 1]]
four    11
one      8
two      9
Name: Utah, dtype: int64
data.iloc[2]
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64
data.iloc[[1, 2], [3, 0, 1]]
fouronetwo
Colorado705
Utah1189
data.loc[:'Utah', 'two']
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64
data.iloc[:, :3][data.three > 5]
onetwothree
Colorado056
Utah8910
New York121314

对于整数索引,pandas不可以用负索引:

ser = pd.Series(np.arange(3.))
ser
0    0.0
1    1.0
2    2.0
dtype: float64
ser[-1] 
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-67-7ed1c232b3a2> in <module>()
----> 1 ser[-1]


~/.conda/envs/python36/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
    599         key = com._apply_if_callable(key, self)
    600         try:
--> 601             result = self.index.get_value(self, key)
    602 
    603             if not is_scalar(result):


~/.conda/envs/python36/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   2475         try:
   2476             return self._engine.get_value(s, k,
-> 2477                                           tz=getattr(series.dtype, 'tz', None))
   2478         except KeyError as e1:
   2479             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:


pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4404)()


pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4087)()


pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()


pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:14031)()


pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13975)()


KeyError: -1
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2
a    0.0
b    1.0
c    2.0
dtype: float64
ser2[-1]
2.0

使用填充的算数方法

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list('abcde'))
df2.loc[1, 'b'] = np.nan
df1
abcd
00.01.02.03.0
14.05.06.07.0
28.09.010.011.0
df2
abcde
00.01.02.03.04.0
15.0NaN7.08.09.0
210.011.012.013.014.0
315.016.017.018.019.0
df1 + df2
abcde
00.02.04.06.0NaN
19.0NaN13.015.0NaN
218.020.022.024.0NaN
3NaNNaNNaNNaNNaN
df1.add(df2, fill_value=0)
abcde
00.02.04.06.04.0
19.05.013.015.09.0
218.020.022.024.014.0
315.016.017.018.019.0
1 / df1
abcd
0inf1.0000000.5000000.333333
10.2500000.2000000.1666670.142857
20.1250000.1111110.1000000.090909
df1.rdiv(1)
abcd
0inf1.0000000.5000000.333333
10.2500000.2000000.1666670.142857
20.1250000.1111110.1000000.090909

灵活算术方法:

方法描述
add, radd想加
sub, rsub相减
div, rdiv相除
floordiv, rfloordiv整除
mul, rmul方法
pow, rpow次方

dataframe和series间的操作:

arr = np.arange(12.).reshape((3, 4))
arr
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])
arr[0]
array([ 0.,  1.,  2.,  3.])
arr - arr[0]
array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

这就是广播机制

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
frame
bde
Utah0.01.02.0
Ohio3.04.05.0
Texas6.07.08.0
Oregon9.010.011.0
series
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64
frame - series
bde
Utah0.00.00.0
Ohio3.03.03.0
Texas6.06.06.0
Oregon9.09.09.0
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
frame + series2
bdef
Utah0.0NaN3.0NaN
Ohio3.0NaN6.0NaN
Texas6.0NaN9.0NaN
Oregon9.0NaN12.0NaN

4. 函数应用和映射

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
bde
Utah-0.894119-1.7070910.843595
Ohio-0.413837-0.2513210.440044
Texas0.8713260.828606-0.521924
Oregon0.6038690.1546792.067279
np.abs(frame)
bde
Utah0.8941191.7070910.843595
Ohio0.4138370.2513210.440044
Texas0.8713260.8286060.521924
Oregon0.6038690.1546792.067279
f = lambda x: x.max() - x.min()
frame.apply(f)
b    1.765445
d    2.535697
e    2.589203
dtype: float64
frame.apply(f, axis='columns')
Utah      2.550685
Ohio      0.853881
Texas     1.393250
Oregon    1.912600
dtype: float64
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
bde
min-0.894119-1.707091-0.521924
max0.8713260.8286062.067279

逐元素的python函数也可以使用。假设你要根据frame中的每个浮点数计算一个格式化字符串,可以使用applymap方法:

format = lambda x: '%.2f' % x
frame.applymap(format)
bde
Utah-0.89-1.710.84
Ohio-0.41-0.250.44
Texas0.870.83-0.52
Oregon0.600.152.07

series自己有map方法:

frame['e'].map(format)
Utah       0.84
Ohio       0.44
Texas     -0.52
Oregon     2.07
Name: e, dtype: object
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值