Pandas中的SettingWithCopyWarning
SettingwithCopyWarning:How to fix this Warning in Pandas
数据来源:https://www.modelingonlineauctions.com/datasets中的Xbox 3 day
Pandas中某些行为返回值的形式分两种:
- return View
- return Copy
从上图可以看出, 左边的df2
相对于df1
只是一个View, 右边的Copy则生成一个独立的DataFrame对象 df2
.
我们或许会根据实际需求,去修改原始的df1
(left),或只想对df2
(right)进行操作。Warning希望我们知道我们的操作可能不是我们想要的
常见问题1:Chained assignment(链式赋值)
当Pandas检测到Chained assignment,它会生成warning。以下定义几个后面将会用到词语:
-
Assignment(赋值):赋值操作/set,
data = pd.read_csv('xbox-3-day-auctions.csv')
. -
Access(访问):返回值/get
-
Indexing(索引):任何references一个子数据集的赋值和访问方法,for example
data[1:5]
. -
Chaining(链操作):back-to-back使用多个索引操作, for example
data[1:5][1:3]
.
Chained assignment 结合了chaining 和 assignment 操作
例子:对bidder
==parakeet2004
字段的内容进行update
- 对current values进行打印
data[data.bidder == 'parakeet2004']
auctionid | bid | bidtime | bidder | bidderrate | openbid | price | |
---|---|---|---|---|---|---|---|
6 | 8213060420 | 3.00 | 0.186539 | parakeet2004 | 5 | 1.0 | 120.0 |
7 | 8213060420 | 10.00 | 0.186690 | parakeet2004 | 5 | 1.0 | 120.0 |
8 | 8213060420 | 24.99 | 0.187049 | parakeet2004 | 5 | 1.0 | 120.0 |
- 对
bidderrate
字段的内容进行update
data[data.bidder == 'parakeet2004']['bidderrate'] = 100
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ipykernel/__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy if __name__ == '__main__':
报出错误—— SettingWithCopyWarning
!
- 如果我们进行纠错,就会发现原来的dataframe里面的内容根本没有变化
data[data.bidder == 'parakeet2004']
auctionid | bid | bidtime | bidder | bidderrate | openbid | price | |
---|---|---|---|---|---|---|---|
6 | 8213060420 | 3.00 | 0.186539 | parakeet2004 | 5 | 1.0 | 120.0 |
7 | 8213060420 | 10.00 | 0.186690 | parakeet2004 | 5 | 1.0 | 120.0 |
8 | 8213060420 | 24.99 | 0.187049 | parakeet2004 | 5 | 1.0 | 120.0 |
Warning的出现原因:把两个索引操作链接在一起
容易发现,我们使用了两次方括号,但如果我们使用其他访问方法,如.bidderrate
、.loc[]
、.iloc[]
、.ix[]
等,情况也是如此。我们的连锁操作是:
-
data[data.bidder == 'parakeet2004']
-
['bidderrate'] = 100
以上两个链式操作相互独立。
第一个是Access操作(get operation),返回一个包A含满足bidder
==parakeet2004
的row的DataFrame
第二个是Assignment操作(set operation),会在一个新的DataFrame
上进行操作,而不是在原始的DataFrame
上进行操作
解决方法:使用loc
将链式操作合并为单个操作
以便Pandas可以确保设置原始DataFrame
,Pandas将始终确保非链式集合操作
# Setting the new value
data.loc[data.bidder == 'parakeet2004', 'bidderrate'] = 100
# Taking a look at the result
data[data.bidder == 'parakeet2004']['bidderrate']
6 100
7 100
8 100
Name: bidderrate, dtype: int64
通过此方法,Warning得到解决
常见问题2:Hidden Chaining(隐藏链接)
对于链式赋值,注意我们使用loc
winners = data.loc[data.bid == data.price]
winners.head()
auctionid | bid | bidtime | bidder | bidderrate | openbid | price | |
---|---|---|---|---|---|---|---|
3 | 8213034705 | 117.5 | 2.998947 | daysrus | 10 | 95.00 | 117.5 |
25 | 8213060420 | 120.0 | 2.999722 | djnoeproductions | 17 | 1.00 | 120.0 |
44 | 8213067838 | 132.5 | 2.996632 | champaignbubbles | 202 | 29.99 | 132.5 |
45 | 8213067838 | 132.5 | 2.997789 | champaignbubbles | 202 | 29.99 | 132.5 |
66 | 8213073509 | 114.5 | 2.999236 | rr6kids | 4 | 1.00 | 114.5 |
尝试对winners.loc[304,'bidder']
的values进行赋值,因为原来没有这个数据
winners.loc[304, 'bidder']
nan
winners.loc[304, 'bidder'] = 'therealname'
Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/indexing.py:517: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self.obj[item] = s
可以看到,尽管我们用了loc
,还是出现了SettingWithCopyWarning
!
Warning出现的原因
因为winners
是通过get operation(data.loc[data.bid == data.price]
)创建的。winners
可能是原始DataFrame
的一个拷贝,也可能不是。但在我们检查之前这是不可预知的。当我们在索引winners
的时候,我们实际上就是在使用链式索引。
意思就是当我们尝试对winners
进行修改的时候,同时可能会修改原始的data
解决方法
为了防止这种情况下出现的SettingWithCopyWarning
,在创建新的dataframe
的时候,就明确的告诉pandas去创建一个copy
winners = data.loc[data.bid == data.price].copy()
winners.loc[304, 'bidder'] = 'therealname'
print(winners.loc[304, 'bidder'])
print(data.loc[304, 'bidder'])
therealname
nan
Tips
The trick is to learn to identify chained indexing and avoid it at all costs.
If you want to change the original, use a single assignment operation.
If you want a copy, make sure you force pandas to do just that.
This will save time and make your code water-tight.
深入链式赋值
Let’s reuse our earlier example where we were trying to update the bidderrate
column for each row in data
with a bidder
value of 'parakeet2004'
.
data[data.bidder == 'parakeet2004']['bidderrate'] = 100
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ipykernel/__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy if __name__ == '__main__':
What pandas is really telling us with this SettingWithCopyWarning
is that the behavior of our code is ambiguous, but to understand why this is and the wording of the warning, it will be helpful to go over a few concepts.
We talked briefly about views and copies earlier. There are two possible ways to access a subset of a DataFrame
: either one could create a reference to the original data in memory (a view) or copy the subset into a new, smaller DataFrame
(a copy). A view is a way of looking at a particular portion the original data, whereas a copy is a clone of that data to a new location in memory. As our diagram earlier showed, modifying a view will modify the original variable but modifying a copy will not.
For reasons that we will get into later, the output of ‘get’ operations in pandas is not guaranteed. Either a view or a copy could be returned when you index a pandas data structure, which means get operations on a DataFrame
return a new DataFrame
that can contain either:
- A copy of data from the original object.
- A reference to the original object’s data without making a copy.
Because we don’t know what will happen and each possibility has very different behavior, ignoring the warning is playing with fire.
To illustrate views, copies and this ambiguity more clearly, let’s create a simple DataFrame
and index it:
df1 = pd.DataFrame(np.arange(6).reshape((3,2)), columns=list('AB'))
df1
A | B | |
---|---|---|
0 | 0 | 1 |
1 | 2 | 3 |
2 | 4 | 5 |
And let’s assign a subset of df1
to df2
:
df2 = df1.loc[:1]
df2
A | B | |
---|---|---|
0 | 0 | 1 |
1 | 2 | 3 |
Given what we have learned, we know that df2
could be a view on df1
or a copy of a subset of df1
.
Before we can get to grips with our problem, we also need to take another look at chained indexing. Expanding on our example with 'parakeet2004'
, we have chained together two indexing operations:
data[data.bidder == 'parakeet2004']
__intermediate__['bidderrate'] = 100
Where __intermediate__
represents the output of the first call and is completely hidden from us. Remember that we would get the same problematic outcome if we had used attribute access:
data[data.bidder == 'parakeet2004'].bidderrate = 100
The same applies to any other form of chained call because we are generating this intermediate object.
Under the hood, chained indexing means making more than one call to __getitem__
or __setitem__
to accomplish a single operation. These are special Python methods that are invoked by the use of square brackets on an instance of a class that implements them, an example of what is called syntactic sugar. Let’s look at what the Python interpreter will execute in our example.
# Our code
data[data.bidder == 'parakeet2004']['bidderrate'] = 100
# Code executed
data.__getitem__(data.__getitem__('bidder') == 'parakeet2004').__setitem__('bidderrate', 100)
As you may have realized already, SettingWithCopyWarning
is generated as a result of this chained __setitem__
call. You can try this for yourself – the lines above function identically. For clarity, note that the second __getitem__
call (for the bidder
column) is nested and not at all part of the chaining problem here.
In general, as discussed, pandas does not guarantee whether a get operation will return a view or a copy of the data. If a view is returned in our example, the second expression in our chained assignment will be a call to __setitem__
on the original object. But, if a copy is returned, it’s the copy that will be modified instead – the original object does not get modified.
This is what the warning means by “a value is trying to be set on a copy of a slice from a DataFrame”. As there are no references to this copy, it will ultimately be garbage collected. The SettingWithCopyWarning
is letting us know that pandas cannot determine whether a view or a copy was returned by the first __getitem__
call, and so it’s unclear whether the assignment changed the original object or not. Another way to think about why pandas gives us this warning is because the answer to the question “are we modifying the original?” is unknown.
We do want to modify the original, and the solution that the warning suggests is to convert these two separate, chained operations into a single assignment operation using loc
. This will remove chained indexing from our code and we will no longer receive the warning. Our fixed code and its expanded version will look like this:
# Our code
data.loc[data.bidder == 'parakeet2004', 'bidderrate'] = 100
# Code executeddata.loc.__setitem__((data.__getitem__('bidder') == 'parakeet2004', 'bidderrate'), 100)
Our DataFrame’s loc
property is guaranteed to be the original DataFrame
itself but with expanded indexing capabilities.
False negatives
Using loc
doesn’t end our problems because get operations with loc
can still return either a view or a copy. Let’s quickly examine a somewhat convoluted example.
data.loc[data.bidder == 'parakeet2004', ('bidderrate', 'bid')]
bidderrate | bid | |
---|---|---|
6 | 100 | 3.00 |
7 | 100 | 10.00 |
8 | 100 | 24.99 |
We’ve pulled two columns out this time rather than just the one. Let’s try to set all the bid
values.
data.loc[data.bidder == 'parakeet2004', ('bidderrate', 'bid')]['bid'] = 5.0
data.loc[data.bidder == 'parakeet2004', ('bidderrate', 'bid')]
bidderrate | bid | |
---|---|---|
6 | 100 | 3.00 |
7 | 100 | 10.00 |
8 | 100 | 24.99 |
No effect and no warning! We have set a value on a copy of a slice but it was not detected by pandas – this is a false negative. Just because we have used loc
doesn’t mean we can start using chained assignment again. There is an old, unresolved issue on GitHub for this particular bug.
The correct way to do this is as follows:
data.loc[data.bidder == 'parakeet2004', 'bid'] = 5.0
data.loc[data.bidder == 'parakeet2004', ('bidderrate', 'bid')]
bidderrate | bid | |
---|---|---|
6 | 100 | 5.0 |
7 | 100 | 5.0 |
8 | 100 | 5.0 |
You might wonder how someone could possibly end up with such a problem in practice, but it’s easier than you might expect when assigning the results of DataFrame
queries to variables as we do in the next section.
Hidden chaining
Let’s look again at our hidden chaining example from earlier, where we were trying to set the bidder
value from the row labelled 304
in our winners
variable.
winners = data.loc[data.bid == data.price]
winners.loc[304, 'bidder'] = 'therealname'
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/indexing.py:517: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self.obj[item] = s
We get another SettingWithCopyWarning
even though we used loc
. This problem can be incredibly confusing as the warning message appears to be suggesting that we do what we have already done.
But think about the winners
variable. What really is it? Given that we instantiated it via data.loc[data.bid == data.price]
, we cannot know whether it’s a view or a copy of our original data
DataFrame
(because get operations return either a view or a copy). Combining the instantiation with the line that generated the warning makes clear our mistake.
data.loc[data.bid == data.price].loc[304, 'bidder'] = 'therealname'
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/indexing.py:517: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self.obj[item] = s
We used chained assignment again, but this time it was broken across two lines. Another way to think about this is to ask the question “does this modify one or two things?” In our case, the answer is unknown: if winners
is a copy then only winners
is affected but if it’s a view both winners
and data
will show updated values. This situation can occur between lines that are very far apart within a script or codebase, making the source of the problem potentially very difficult to track down.
The intention of the warning here to prevent us from thinking our code will modify the original DataFrame
when it won’t, or that we’re modifying a copy rather than the original. Delving into old issues on pandas’ GitHub repo, you can read the devs explaining this themselves.
如果倾向于对copy进行操作
How we resolve this problem depends very much on our own intentions. If we are happy to work with a copy of our original data, the solution is simply to force pandas to make a copy.
winners = data.loc[data.bid == data.price].copy()
winners.loc[304, 'bidder'] = 'therealname'
print(data.loc[304, 'bidder']) # Original
print(winners.loc[304, 'bidder']) # Copy
nan
therealname
如果倾向于对original Dataframe进行操作——用mask进行布尔索引
If, on the other hand, you require that the original DataFrame
is updated then you should work with the original DataFrame
instead of instantiating other variables with unknown behavior. Our prior code would become:
# Finding the winners
winner_mask = data.bid == data.price
# Taking a peek
data.loc[winner_mask].head()
# Doing analysis
mean_win_time = data.loc[winner_mask, 'bidtime'].mean()
... # 20 lines of code
mode_open_bid = data.loc[winner_mask, 'openbid'].mode()
# Updating the username
data.loc[304, 'bidder'] = 'therealname'
In more complex circumstances, such as modifying a subset of a subset of a DataFrame
, instead of using chained indexing one can modify the slices one is making via loc
on the original DataFrame
. For example, you could change our new winner_mask
variable above or create a new variable that selected a subset of winners, like so:
high_winner_mask = winner_mask & (data.price > 150)
data.loc[high_winner_mask].head()
auctionid | bid | bidtime | bidder | bidderrate | openbid | price | bidtime_hours | |
---|---|---|---|---|---|---|---|---|
225 | 8213387444 | 152.0 | 2.919757 | uconnbabydoll1975 | 15 | 0.99 | 152.0 | 70.074168 |
328 | 8213935134 | 207.5 | 2.983542 | toby2492 | 0 | 0.10 | 207.5 | 71.605008 |
416 | 8214430396 | 199.0 | 2.990463 | volpendesta | 4 | 9.99 | 199.0 | 71.771112 |
531 | 8215582227 | 152.5 | 2.999664 | ultimatum_man | 2 | 60.00 | 152.5 | 71.991936 |
This technique is more robust to future codebase maintenance and scaling.
Level up your data skills!
Clear explanations.
No gaps.
Fast feedback.
History
You might be wondering why the whole SettingWithCopy
problem can’t simply be avoided entirely by explicitly specifying indexing methods that return either a view or a copy rather than creating the confusing situation we find ourselves in. To understand this, we must look into pandas’ past.
The logic pandas uses to determine whether it returns a view or a copy stems from its use of the NumPy library, which underlies pandas’ operation. Views actually entered the pandas lexicon via NumPy. Indeed, views are useful in NumPy because they are returned predictably. Because NumPy arrays are single-typed, pandas attempts to minimize space and processing requirements by using the most appropriate dtype. As a result, slices of a DataFrame
that contain a single dtype can be returned as a view on a single NumPy array, which is a highly efficient way to handle the operation. However, multi-dtype slices can’t be stored in the same way in NumPy so efficiently. Pandas juggles versatile indexing functionality with the ability to use its NumPy core most effectively.
Ultimately, indexing in pandas was designed to be useful and versatile in a way that doesn’t exactly marry the functionality of the underlying NumPy arrays at its core. The interaction between these elements of design and function over time has led to a complex set of rules that determine whether or not a view or a copy can be returned. Experienced pandas developers are generally happy with pandas’ behaviors because they are comfortable l navigating its indexing behaviors.
Unfortunately for newcomers to the library, chained indexing is almost unavoidable despite not being the intended approach simply because get operations return indexable pandas objects. Furthermore, in the words of Jeff Reback, one of the core developers of pandas for several years, “It’s simply not possible from a language perspective to detect chain indexing directly; it has to be inferred”.
Consequently, the warning was introduced in version 0.13.0 near the end of 2013 as a solution to the silent failure of chained assignment encountered by many developers.
Prior to version 0.12, the ix
indexer was the most popular (in the pandas nomenclature, “indexers” such as ix
, loc
and iloc
are simply constructs that allow objects to be indexed with square brackets just like arrays, but with special behavior). But it was around this time, in mid-2013, that the pandas project was beginning to gain momentum and catering to novice users was of rising importance. Since this release the loc
and iloc
indexers have consequently been preferred for their more explicit nature and easier to interpret usages.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Z8eDBOw3-1631812826859)(数据分析.assets/pandas-interest.png)]
Google Trends: pandas
The SettingWithCopyWarning
has continued to evolve after its introduction, was hotly discussed in many GitHub issues for several years, and is even still being updated, but it’s here to stay and understanding it remains crucial to becoming a pandas expert.
Wrapping up
The complexity underlying the SettingWithCopyWarning
is one of the few rough edges in the pandas library. Its roots are very deeply embedded in the library and should not be ignored. In Jeff Reback’s own words there “are no cases that I am aware [of] that you should actually ignore this warning. … If you do certain types of indexing it will never work, others it will work. You are really playing with fire.”
Fortunately, addressing the warning only requires you to identify chained assignment and fix it. If there’s just one thing to take away from all this, it’s that.