pandas之链式索引问题（chained indexing）

Elwin Wong

已于 2022-09-07 11:15:53 修改

阅读量2.3k

点赞数 8

分类专栏： Python # pandas 文章标签： pandas python 数据分析

于 2022-09-06 10:44:11 首次发布

本文链接：https://blog.csdn.net/zhaoyuanh/article/details/126720017

版权

Python 同时被 2 个专栏收录

12 篇文章 1 订阅

订阅专栏

pandas

5 篇文章 0 订阅

订阅专栏

文章目录

链式索引
链式索引赋值问题
索引执行顺序的影响
链式索引提醒选项
总结

SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

这是在使用pandas的过程中经常会遇到的一个警告，意思是试图对一个DataFrame切片的副本进行赋值。正常来说，肯定不会无缘无故出现警告，这中间肯定有坑，所以有必要通过警告中提示的链接一探究竟。

链式索引

在对pandas对象设置值的时候，必须要特别注意避免所谓的链式索引（chained indexing）问题。

什么是链式索引？就是对DataFrame连续地使用[]进行索引，底层行为表现为连续使用__getitem__操作，这是线性依次的操作，而不是整体地对最初地DataFrame进行操作。

看看pandas文档给的例子：

In [23]: dfmi = pd.DataFrame(
    ...:     [list('abcd'), list('efgh'), list('ijkl'), list('mnop')],
    ...:     columns=pd.MultiIndex.from_product([['one', 'two'],['first', 'second']])
    ...: )

两种访问方式：

# 链式索引
In [24]: dfmi['one']['second']
Out[24]:
0    b
1    f
2    j
3    n
Name: second, dtype: object

# 一次性索引
In [25]: dfmi.loc[:, ('one', 'second')]
Out[25]:
0    b
1    f
2    j
3    n
Name: (one, second), dtype: object

虽然两种方式返回的结果基本一样（除了name属性），但是底层的代码执行逻辑还是有很大差别的。

对于第一种方式，dfmi['one']对第一级列名进行索引并返回一个DataFrame，我们将这个DataFrame标记为dfmi_with_one，然后接下来的['second']操作则是对dfmi_with_one进行索引（也就是dfmi_with_one['second']），返回由'second'索引的Series。可以看到，在链式索引中，每一次索引[]都是单独的、仅针对前一次索引返回的结果进行的操作，跟前面的无关。

与第一种方式相比，第二种方式df.loc[:,('one','second')]传递一个嵌套的元组(slice(None),('one','second'))给__getitem__，并且只调用一次。这使得pandas可以将其当作单个实体进行处理。而且这种操作更快，需要的话也可以同时对两个轴进行索引。

其实从两者返回的Series.name（一个为second，一个为(one, second)）也可以看出，第一种方式是分别执行的操作，第二种方式是整体执行的操作。

链式索引赋值问题

上节中的问题只是一个性能问题，但如果对链式索引的结果赋值则会产生不可预测结果。要了解这一点，需要看看Python解释器如何执行这些代码：

dfmi.loc[:, ('one', 'second')] = value
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)

而链式索引的方式则是这样的：

dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)

可以看到中间存在一个__getitem__的调用，除非是很简单的情况，否则很难判断这个__getitem__返回的是一个视图（view）还是一个副本（copy）（pandas文档说这取决于数组的内存布局，pandas对此没有保证），因此也无法判断后续的__setitem__修改的是dfmi还是一个之后马上就会被丢弃的临时对象。这就是开头的SettingWithCopy要警告的内容。

另外，对于使用loc的方式，注意到__setitem__前面的loc属性，pandas能够保证dfmi.loc是dfmi自身，因此dfmi.loc.__getitem__和dfmi.loc.__setitem__是直接在dfmi上操作。当然，dfmi.loc.__getitem__(idx)则可能是dfmi的视图或者副本。

我们来看看实际这两种操作的执行结果：

使用loc赋值

In [27]: dfmi.loc[:, ('one', 'second')] = list('1234')

In [28]: dfmi
Out[28]:
    one          two
  first second first second
0     a      1     c      d
1     e      2     g      h
2     i      3     k      l
3     m      4     o      p

成功赋值

使用链式索引赋值

In [29]: dfmi['one']['second'] = list('5678')
<ipython-input-29-7370041e44f2>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfmi['one']['second'] = list('5678')

In [30]: dfmi
Out[30]:
    one          two
  first second first second
0     a      1     c      d
1     e      2     g      h
2     i      3     k      l
3     m      4     o      p

出现了SettingWithCopyWarning警告，并且赋值不起作用，dfmi并没有被修改。

另外，如果使用loc进行链式索引也会出现同样的警告，原因上面已经说过了，df.loc.__getitem__(idx)则可能是df的视图或者副本，其行为也不可预测，避免这样使用：

In [31]: dfmi.loc[:, 'one'].loc[:, 'second'] = list('5678')
<ipython-input-16-791a61a3bb59>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfmi.loc[:, 'one'].loc[:, 'second'] = list('5678')

# 虽然dfmi改变了，但是其行为依然是不可预测的，要避免使用loc链式索引
In [32]: dfmi
Out[32]:
    one          two
  first second first second
0     a      5     c      d
1     e      6     g      h
2     i      7     k      l
3     m      8     o      p

有时候没有明显的链式索引，但也可能会出现SettingWithCopy警告。以下pandas文档中给出的代码就是这样的情况：

def do_something(df):
    foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
    # ... many lines here ...
    # We don't know whether this will modify df or not!
    foo['quux'] = value
    return foo

另一个例子：

In [33]: dfsi = pd.DataFrame(
   ...:     [list('abcd'), list('efgh'), list('ijkl'), list('mnop')],
   ...:     columns=['one', 'two', 'first', 'second']
   ...: )

In [34]: onetwo = dfsi[['one', 'two']]

In [35]: onetwo['one'] = list('1234')
<ipython-input-5-81f0fc384f1d>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  onetwo['one'] = list('1234')

# dfsi没变，说明上面对dfsi的索引返回的是副本
In [36]: dfsi
Out[36]:
  one two first second
0   a   b     c      d
1   e   f     g      h
2   i   j     k      l
3   m   n     o      p

In [37]: onetwo
Out[37]:
  one two
0   1   b
1   2   f
2   3   j
3   4   n

这其实就是把链式索引赋值的过程拆分成多行代码了，本质上还是这个问题，但是pandas会尝试去识别出这些问题并发出警告。所以当出现这样的警告时，应该检查下代码中是否出现链式索引赋值的问题，因为其行为不可预测，赋值可能不会生效，应当使用loc代替，除非你确认链式索引就是你所需要的。

索引执行顺序的影响

使用链式索引时，索引的类型和索引操作的顺序对于返回的结果是原始对象的切片还是切片的副本是有影响的：

In [38]: dfa = pd.DataFrame(
    ...:     {'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
    ...:      'c': np.arange(7)}
    ...: )

In [39]: dfb = dfa.copy()

# This will show the SettingWithCopyWarning
# but the frame values will be set
In [40]: dfb['c'][dfb['a'].str.startswith('o')] = 42
<ipython-input-25-57ce4ff20dfc>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfb['c'][dfb['a'].str.startswith('o')] = 42

In [41]: dfb
Out[41]:
       a   c
0    one  42
1    one  42
2    two   2
3  three   3
4    two   4
5    one  42
6    six   6

In [42]: dfb = dfa.copy()

# This however is operating on a copy and will not work
In [43]: dfb[dfb['a'].str.startswith('o')]['c'] = 42
<ipython-input-29-216d8bd475bb>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfb[dfb['a'].str.startswith('o')]['c'] = 42

In [44]: dfb
Out[44]:
       a  c
0    one  0
1    one  1
2    two  2
3  three  3
4    two  4
5    one  5
6    six  6

对于上述的场景，pandas文档推荐的使用.loc访问的方式如下：

In [45]: dfb = dfa.copy()

# Setting multiple items using a mask
In [46]: mask = dfb['a'].str.startswith('o')

In [47]: dfb.loc[mask, 'c'] = 42

In [48]: dfb
Out[48]:
       a   c
0    one  42
1    one  42
2    two   2
3  three   3
4    two   4
5    one  42
6    six   6

# Setting a single item
In [49]: dfb = dfa.copy()

In [50]: dfb.loc[2, 'a'] = 11

In [51]: dfb
Out[51]:
       a  c
0    one  0
1    one  1
2     11  2
3  three  3
4    two  4
5    one  5
6    six  6

链式索引提醒选项

pandas中提供了一个选项mode.chained_assignment，用于设置出现链式索引问题后提醒的级别，该选项有三个可选的值：

warn：发出警告，默认值，会输出SettingWithCopyWarning
raise：抛出异常SettingWithCopyError，必须解决链式索引的问题
None：忽略链式索引问题，不发出警告，也不抛出异常

In [52]: pd.set_option('mode.chained_assignment','raise')

In [53]: dfb[dfb['a'].str.startswith('o')]['c'] = 42
---------------------------------------------------------------------------
SettingWithCopyError                      Traceback (most recent call last)
...
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

总结

链式索引赋值会产生不可预测的行为，要避免使用链式索引，改为使用.loc[row_indexer,col_indexer] = value

链式索引赋值例子：

dfmi['one']['second'] = list('5678')
dfmi.loc[:, 'one'].loc[:, 'second'] = list('5678')
dfb['c'][dfb['a'].str.startswith('o')] = 42
dfb[dfb['a'].str.startswith('o')]['c'] = 42
dfb['a'][2] = 111
dfb.loc[0]['a'] = 1111

onetwo = dfsi[['one', 'two']]
onetwo['one'] = list('1234')

...

改为使用.loc：

dfmi.loc[:, ('one', 'second')] = list('1234')
dfb.loc[dfb['a'].str.startswith('o'), 'c'] = 42
dfb.loc[2, 'a'] = 111
dfb.loc[0, 'a'] = 1111

Elwin Wong

关注

8
点赞
踩
12

收藏

觉得还不错? 一键收藏
打赏
0
评论
pandas之链式索引问题（chained indexing）

链式索引赋值会产生不可预测的行为，要避免使用链式索引，改为使用.loc。
复制链接

扫一扫

专栏目录