Pandas系列2-DataFrame之数据定位

最新推荐文章于 2023-03-14 13:29:48 发布

柯墨

最新推荐文章于 2023-03-14 13:29:48 发布

阅读量9.4k

点赞数 4

分类专栏： python 文章标签： python

原文链接：https://www.jianshu.com/p/cc1e32c27712

版权

python 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

在Pandas中我们主要通过以下几个函数来定位DataFrame中的特定数据

iloc
loc
iat
at

总的来说，分为两种：

一种是通过lables(即row index和column names，这里row index可以字符，日期等非数字index)(使用loc, at);
另一种通过index(这里特指数字位置index)(使用iloc, iat)

loc和at的区别在于， loc可以选择特定的行或列，但是at只能定位某个特定的值，标量值。一般情况下，我们iloc和loc更加通用，而at, iat有一定的性能提升。

具体示例可以参考Reference中StackOverflow的示例
下面展示一些特别的：

In [630]: df
Out[630]:
           age  color    food  height  score state
Jane        30   blue   Steak     165    4.6    NY
Nick         2  green    Lamb      70    8.3    TX
Aaron       12    red   Mango     120    9.0    FL
Penelope     4  white   Apple      80    3.3    AL
Dean        32   gray  Cheese     180    1.8    AK
Christina   33  black   Melon     172    9.5    TX
Cornelia    69    red   Beans     150    2.2    TX

# 选择某一行数据
In [631]: df.loc['Dean']
Out[631]:
age           32
color       gray
food      Cheese
height       180
score        1.8
state         AK
Name: Dean, dtype: object

# 选择某一列数据，逗号前面是行的label，逗号后边是列的label，使用":"来表示选取所有的，本例是选取所有的行，当':'在逗号后边时表示选取所有的列，但通常我们可以省略。
In [241]: df.loc[:, 'color']
Out[241]:
Jane          blue
Nick         green
Aaron          red
Penelope     white
Dean          gray
Christina    black
Cornelia       red
Name: color, dtype: object
# 也可以如下选取一列，但是与前者是有区别的，具体参考Reference中的《Returning a view versus a copy》
In [632]: df.loc[:]['color']
Out[632]:
Jane          blue
Nick         green
Aaron          red
Penelope     white
Dean          gray
Christina    black
Cornelia       red
Name: color, dtype: object

# 选择某几行数据，注意无论选择多行还是多列，都需要将其label放在一个数组当中，而选择单个行或列，则不需要放在数组当中
In [634]: df.loc[['Nick', 'Dean']]
Out[634]:
      age  color    food  height  score state
Nick    2  green    Lamb      70    8.3    TX
Dean   32   gray  Cheese     180    1.8    AK

# 注意以下这种用法不行，这是由于Pandas会认为逗号后边是列的label
df.loc['Nick', 'Dean']

# 选择范围
In [636]: df.loc['Nick':'Christina']
Out[636]:
           age  color    food  height  score state
Nick         2  green    Lamb      70    8.3    TX
Aaron       12    red   Mango     120    9.0    FL
Penelope     4  white   Apple      80    3.3    AL
Dean        32   gray  Cheese     180    1.8    AK
Christina   33  black   Melon     172    9.5    TX

# iloc的特定用法, 可以用-1这样index来获取最后一行的数据
In [637]: df.iloc[[1, -1]]
Out[637]:
          age  color   food  height  score state
Nick        2  green   Lamb      70    8.3    TX
Cornelia   69    red  Beans     150    2.2    TX

数据定位是后续条件过滤、赋值以及各种转换的基础，一定要熟练掌握。

另外，在定位某一个具体的元素的时候，loc和at并不完全相同

# loc支持以下两种定位方式
In [726]: df.loc['Jane', 'score']
Out[726]: 4.6

In [727]: df.loc['Jane']['score']
Out[727]: 4.6

# 但是at只支持第一种定位方式
In [729]: df.at['Nick', 'height']
Out[729]: 181

In [730]: df.at['Nick']['height']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-730-948408df1727> in <module>()
----> 1 df.at['Nick']['height']

~/.pyenv/versions/3.6.4/envs/data_analysis/lib/python3.6/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1867
   1868         key = self._convert_key(key)
-> 1869         return self.obj._get_value(*key, takeable=self._takeable)
   1870
   1871     def __setitem__(self, key, value):

TypeError: _get_value() missing 1 required positional argument: 'col'

有两点需要说明：

在针对特定元素赋值的时候最好使用at来进行操作，性能提升还是很明显的。
loc的两种方式并不等同，df.loc['Jane', 'score']是在同一块内存中对数据进行操作，而df.loc['Jane']['score']是在另一个copy上进行操作，具体参考Returning a view versus a copy

作者：geekpy
链接：https://www.jianshu.com/p/cc1e32c27712
来源：简书
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。