Pandas中的SettingWithCopyWarning

Pandas中的SettingWithCopyWarning

SettingwithCopyWarning:How to fix this Warning in Pandas

数据来源:https://www.modelingonlineauctions.com/datasets中的Xbox 3 day

Pandas中某些行为返回值的形式分两种:

  • return View
  • return Copy
    在这里插入图片描述

从上图可以看出, 左边的df2 相对于df1只是一个View, 右边的Copy则生成一个独立的DataFrame对象 df2.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3v7FanyL-1631812826857)(数据分析.assets/modifying.png)]

我们或许会根据实际需求,去修改原始的df1(left),或只想对df2(right)进行操作。Warning希望我们知道我们的操作可能不是我们想要的

常见问题1:Chained assignment(链式赋值)

当Pandas检测到Chained assignment,它会生成warning。以下定义几个后面将会用到词语:

  • Assignment(赋值):赋值操作/setdata = pd.read_csv('xbox-3-day-auctions.csv').

  • Access(访问):返回值/get

  • Indexing(索引):任何references一个子数据集的赋值和访问方法,for example data[1:5].

  • Chaining(链操作):back-to-back使用多个索引操作, for example data[1:5][1:3].

Chained assignment 结合了chaining 和 assignment 操作

例子:对bidder==parakeet2004字段的内容进行update

  1. 对current values进行打印
data[data.bidder == 'parakeet2004']
auctionidbidbidtimebidderbidderrateopenbidprice
682130604203.000.186539parakeet200451.0120.0
7821306042010.000.186690parakeet200451.0120.0
8821306042024.990.187049parakeet200451.0120.0
  1. bidderrate 字段的内容进行update
data[data.bidder == 'parakeet2004']['bidderrate'] = 100
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ipykernel/__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy if __name__ == '__main__':

报出错误—— SettingWithCopyWarning!

  1. 如果我们进行纠错,就会发现原来的dataframe里面的内容根本没有变化
data[data.bidder == 'parakeet2004']
auctionidbidbidtimebidderbidderrateopenbidprice
682130604203.000.186539parakeet200451.0120.0
7821306042010.000.186690parakeet200451.0120.0
8821306042024.990.187049parakeet200451.0120.0

Warning的出现原因:把两个索引操作链接在一起

容易发现,我们使用了两次方括号,但如果我们使用其他访问方法,如.bidderrate.loc[].iloc[].ix[] 等,情况也是如此。我们的连锁操作是:

  • data[data.bidder == 'parakeet2004']

  • ['bidderrate'] = 100

以上两个链式操作相互独立。

第一个是Access操作(get operation),返回一个包A含满足bidder==parakeet2004的row的DataFrame
第二个是Assignment操作(set operation),会在一个新的DataFrame上进行操作,而不是在原始的DataFrame上进行操作

解决方法:使用loc将链式操作合并为单个操作

以便Pandas可以确保设置原始DataFrame,Pandas将始终确保非链式集合操作

# Setting the new value
data.loc[data.bidder == 'parakeet2004', 'bidderrate'] = 100
# Taking a look at the result
data[data.bidder == 'parakeet2004']['bidderrate']
6 100
7 100
8 100
Name: bidderrate, dtype: int64

通过此方法,Warning得到解决

常见问题2:Hidden Chaining(隐藏链接)

对于链式赋值,注意我们使用loc

winners = data.loc[data.bid == data.price]
winners.head()
auctionidbidbidtimebidderbidderrateopenbidprice
38213034705117.52.998947daysrus1095.00117.5
258213060420120.02.999722djnoeproductions171.00120.0
448213067838132.52.996632champaignbubbles20229.99132.5
458213067838132.52.997789champaignbubbles20229.99132.5
668213073509114.52.999236rr6kids41.00114.5

尝试对winners.loc[304,'bidder']的values进行赋值,因为原来没有这个数据

winners.loc[304, 'bidder']
nan
winners.loc[304, 'bidder'] = 'therealname'
Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/indexing.py:517: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self.obj[item] = s

可以看到,尽管我们用了loc,还是出现了SettingWithCopyWarning

Warning出现的原因

因为winners是通过get operation(data.loc[data.bid == data.price])创建的。winners可能是原始DataFrame的一个拷贝,也可能不是。但在我们检查之前这是不可预知的。当我们在索引winners的时候,我们实际上就是在使用链式索引。

意思就是当我们尝试对winners进行修改的时候,同时可能会修改原始的data

解决方法

为了防止这种情况下出现的SettingWithCopyWarning,在创建新的dataframe的时候,就明确的告诉pandas去创建一个copy

winners = data.loc[data.bid == data.price].copy()
winners.loc[304, 'bidder'] = 'therealname'
print(winners.loc[304, 'bidder'])
print(data.loc[304, 'bidder'])
therealname
nan

Tips

The trick is to learn to identify chained indexing and avoid it at all costs.

If you want to change the original, use a single assignment operation.
If you want a copy, make sure you force pandas to do just that.

This will save time and make your code water-tight.

深入链式赋值

Let’s reuse our earlier example where we were trying to update the bidderrate column for each row in data with a bidder value of 'parakeet2004'.

data[data.bidder == 'parakeet2004']['bidderrate'] = 100
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ipykernel/__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy if __name__ == '__main__':

What pandas is really telling us with this SettingWithCopyWarning is that the behavior of our code is ambiguous, but to understand why this is and the wording of the warning, it will be helpful to go over a few concepts.

We talked briefly about views and copies earlier. There are two possible ways to access a subset of a DataFrame: either one could create a reference to the original data in memory (a view) or copy the subset into a new, smaller DataFrame (a copy). A view is a way of looking at a particular portion the original data, whereas a copy is a clone of that data to a new location in memory. As our diagram earlier showed, modifying a view will modify the original variable but modifying a copy will not.

For reasons that we will get into later, the output of ‘get’ operations in pandas is not guaranteed. Either a view or a copy could be returned when you index a pandas data structure, which means get operations on a DataFrame return a new DataFrame that can contain either:

  • A copy of data from the original object.
  • A reference to the original object’s data without making a copy.

Because we don’t know what will happen and each possibility has very different behavior, ignoring the warning is playing with fire.

To illustrate views, copies and this ambiguity more clearly, let’s create a simple DataFrame and index it:

df1 = pd.DataFrame(np.arange(6).reshape((3,2)), columns=list('AB'))
df1
AB
001
123
245

And let’s assign a subset of df1 to df2:

df2 = df1.loc[:1]
df2
AB
001
123

Given what we have learned, we know that df2 could be a view on df1 or a copy of a subset of df1.

Before we can get to grips with our problem, we also need to take another look at chained indexing. Expanding on our example with 'parakeet2004', we have chained together two indexing operations:

data[data.bidder == 'parakeet2004']
__intermediate__['bidderrate'] = 100

Where __intermediate__ represents the output of the first call and is completely hidden from us. Remember that we would get the same problematic outcome if we had used attribute access:

data[data.bidder == 'parakeet2004'].bidderrate = 100

The same applies to any other form of chained call because we are generating this intermediate object.

Under the hood, chained indexing means making more than one call to __getitem__ or __setitem__ to accomplish a single operation. These are special Python methods that are invoked by the use of square brackets on an instance of a class that implements them, an example of what is called syntactic sugar. Let’s look at what the Python interpreter will execute in our example.

# Our code
data[data.bidder == 'parakeet2004']['bidderrate'] = 100
# Code executed
data.__getitem__(data.__getitem__('bidder') == 'parakeet2004').__setitem__('bidderrate', 100)

As you may have realized already, SettingWithCopyWarning is generated as a result of this chained __setitem__ call. You can try this for yourself – the lines above function identically. For clarity, note that the second __getitem__ call (for the bidder column) is nested and not at all part of the chaining problem here.

In general, as discussed, pandas does not guarantee whether a get operation will return a view or a copy of the data. If a view is returned in our example, the second expression in our chained assignment will be a call to __setitem__ on the original object. But, if a copy is returned, it’s the copy that will be modified instead – the original object does not get modified.

This is what the warning means by “a value is trying to be set on a copy of a slice from a DataFrame”. As there are no references to this copy, it will ultimately be garbage collected. The SettingWithCopyWarning is letting us know that pandas cannot determine whether a view or a copy was returned by the first __getitem__ call, and so it’s unclear whether the assignment changed the original object or not. Another way to think about why pandas gives us this warning is because the answer to the question “are we modifying the original?” is unknown.

We do want to modify the original, and the solution that the warning suggests is to convert these two separate, chained operations into a single assignment operation using loc. This will remove chained indexing from our code and we will no longer receive the warning. Our fixed code and its expanded version will look like this:

# Our code
data.loc[data.bidder == 'parakeet2004', 'bidderrate'] = 100
# Code executeddata.loc.__setitem__((data.__getitem__('bidder') == 'parakeet2004', 'bidderrate'), 100)

Our DataFrame’s loc property is guaranteed to be the original DataFrame itself but with expanded indexing capabilities.

False negatives

Using loc doesn’t end our problems because get operations with loc can still return either a view or a copy. Let’s quickly examine a somewhat convoluted example.

data.loc[data.bidder == 'parakeet2004', ('bidderrate', 'bid')]
bidderratebid
61003.00
710010.00
810024.99

We’ve pulled two columns out this time rather than just the one. Let’s try to set all the bid values.

data.loc[data.bidder == 'parakeet2004', ('bidderrate', 'bid')]['bid'] = 5.0
data.loc[data.bidder == 'parakeet2004', ('bidderrate', 'bid')]
bidderratebid
61003.00
710010.00
810024.99

No effect and no warning! We have set a value on a copy of a slice but it was not detected by pandas – this is a false negative. Just because we have used loc doesn’t mean we can start using chained assignment again. There is an old, unresolved issue on GitHub for this particular bug.

The correct way to do this is as follows:

data.loc[data.bidder == 'parakeet2004', 'bid'] = 5.0
data.loc[data.bidder == 'parakeet2004', ('bidderrate', 'bid')]
bidderratebid
61005.0
71005.0
81005.0

You might wonder how someone could possibly end up with such a problem in practice, but it’s easier than you might expect when assigning the results of DataFrame queries to variables as we do in the next section.

Hidden chaining

Let’s look again at our hidden chaining example from earlier, where we were trying to set the bidder value from the row labelled 304 in our winners variable.

winners = data.loc[data.bid == data.price]
winners.loc[304, 'bidder'] = 'therealname'
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/indexing.py:517: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self.obj[item] = s

We get another SettingWithCopyWarning even though we used loc. This problem can be incredibly confusing as the warning message appears to be suggesting that we do what we have already done.

But think about the winners variable. What really is it? Given that we instantiated it via data.loc[data.bid == data.price], we cannot know whether it’s a view or a copy of our original data DataFrame (because get operations return either a view or a copy). Combining the instantiation with the line that generated the warning makes clear our mistake.

data.loc[data.bid == data.price].loc[304, 'bidder'] = 'therealname'
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/indexing.py:517: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self.obj[item] = s

We used chained assignment again, but this time it was broken across two lines. Another way to think about this is to ask the question “does this modify one or two things?” In our case, the answer is unknown: if winners is a copy then only winners is affected but if it’s a view both winners and data will show updated values. This situation can occur between lines that are very far apart within a script or codebase, making the source of the problem potentially very difficult to track down.

The intention of the warning here to prevent us from thinking our code will modify the original DataFrame when it won’t, or that we’re modifying a copy rather than the original. Delving into old issues on pandas’ GitHub repo, you can read the devs explaining this themselves.

如果倾向于对copy进行操作

How we resolve this problem depends very much on our own intentions. If we are happy to work with a copy of our original data, the solution is simply to force pandas to make a copy.

winners = data.loc[data.bid == data.price].copy()
winners.loc[304, 'bidder'] = 'therealname'
print(data.loc[304, 'bidder']) # Original
print(winners.loc[304, 'bidder']) # Copy
nan
therealname

如果倾向于对original Dataframe进行操作——用mask进行布尔索引

If, on the other hand, you require that the original DataFrame is updated then you should work with the original DataFrame instead of instantiating other variables with unknown behavior. Our prior code would become:

# Finding the winners
winner_mask = data.bid == data.price
# Taking a peek
data.loc[winner_mask].head()
# Doing analysis
mean_win_time = data.loc[winner_mask, 'bidtime'].mean()
... # 20 lines of code
mode_open_bid = data.loc[winner_mask, 'openbid'].mode()
# Updating the username
data.loc[304, 'bidder'] = 'therealname'

In more complex circumstances, such as modifying a subset of a subset of a DataFrame, instead of using chained indexing one can modify the slices one is making via loc on the original DataFrame. For example, you could change our new winner_mask variable above or create a new variable that selected a subset of winners, like so:

high_winner_mask = winner_mask & (data.price > 150)
data.loc[high_winner_mask].head()
auctionidbidbidtimebidderbidderrateopenbidpricebidtime_hours
2258213387444152.02.919757uconnbabydoll1975150.99152.070.074168
3288213935134207.52.983542toby249200.10207.571.605008
4168214430396199.02.990463volpendesta49.99199.071.771112
5318215582227152.52.999664ultimatum_man260.00152.571.991936

This technique is more robust to future codebase maintenance and scaling.

Level up your data skills!

Clear explanations.

No gaps.

Fast feedback.

Sign up (it’s free!)

History

You might be wondering why the whole SettingWithCopy problem can’t simply be avoided entirely by explicitly specifying indexing methods that return either a view or a copy rather than creating the confusing situation we find ourselves in. To understand this, we must look into pandas’ past.

The logic pandas uses to determine whether it returns a view or a copy stems from its use of the NumPy library, which underlies pandas’ operation. Views actually entered the pandas lexicon via NumPy. Indeed, views are useful in NumPy because they are returned predictably. Because NumPy arrays are single-typed, pandas attempts to minimize space and processing requirements by using the most appropriate dtype. As a result, slices of a DataFrame that contain a single dtype can be returned as a view on a single NumPy array, which is a highly efficient way to handle the operation. However, multi-dtype slices can’t be stored in the same way in NumPy so efficiently. Pandas juggles versatile indexing functionality with the ability to use its NumPy core most effectively.

Ultimately, indexing in pandas was designed to be useful and versatile in a way that doesn’t exactly marry the functionality of the underlying NumPy arrays at its core. The interaction between these elements of design and function over time has led to a complex set of rules that determine whether or not a view or a copy can be returned. Experienced pandas developers are generally happy with pandas’ behaviors because they are comfortable l navigating its indexing behaviors.

Unfortunately for newcomers to the library, chained indexing is almost unavoidable despite not being the intended approach simply because get operations return indexable pandas objects. Furthermore, in the words of Jeff Reback, one of the core developers of pandas for several years, “It’s simply not possible from a language perspective to detect chain indexing directly; it has to be inferred”.

Consequently, the warning was introduced in version 0.13.0 near the end of 2013 as a solution to the silent failure of chained assignment encountered by many developers.

Prior to version 0.12, the ix indexer was the most popular (in the pandas nomenclature, “indexers” such as ix, loc and iloc are simply constructs that allow objects to be indexed with square brackets just like arrays, but with special behavior). But it was around this time, in mid-2013, that the pandas project was beginning to gain momentum and catering to novice users was of rising importance. Since this release the loc and iloc indexers have consequently been preferred for their more explicit nature and easier to interpret usages.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Z8eDBOw3-1631812826859)(数据分析.assets/pandas-interest.png)]

Google Trends: pandas

The SettingWithCopyWarning has continued to evolve after its introduction, was hotly discussed in many GitHub issues for several years, and is even still being updated, but it’s here to stay and understanding it remains crucial to becoming a pandas expert.

Wrapping up

The complexity underlying the SettingWithCopyWarning is one of the few rough edges in the pandas library. Its roots are very deeply embedded in the library and should not be ignored. In Jeff Reback’s own words there “are no cases that I am aware [of] that you should actually ignore this warning. … If you do certain types of indexing it will never work, others it will work. You are really playing with fire.”

Fortunately, addressing the warning only requires you to identify chained assignment and fix it. If there’s just one thing to take away from all this, it’s that.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值