本文翻译自:How to deal with SettingWithCopyWarning in Pandas?
Background 背景
I just upgraded my Pandas from 0.11 to 0.13.0rc1. 我刚刚将熊猫从0.11升级到0.13.0rc1。 Now, the application is popping out many new warnings. 现在,该应用程序弹出许多新警告。 One of them like this: 其中之一是这样的:
E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE
I want to know what exactly it means? 我想知道到底是什么意思? Do I need to change something? 我需要改变什么吗?
How should I suspend the warning if I insist to use quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE
? 如果我坚持使用quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE
应该如何暂停警告?
The function that gives errors 产生错误的功能
def _decode_stock_quote(list_of_150_stk_str):
"""decode the webpage and return dataframe"""
from cStringIO import StringIO
str_of_all = "".join(list_of_150_stk_str)
quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
quote_df['TClose'] = quote_df['TPrice']
quote_df['RT'] = 100 * (quote_df['TPrice']/quote_df['TPCLOSE'] - 1)
quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE
quote_df['TAmt'] = quote_df['TAmt']/TAMT_SCALE
quote_df['STK_ID'] = quote_df['STK'].str.slice(13,19)
quote_df['STK_Name'] = quote_df['STK'].str.slice(21,30)#.decode('gb2312')
quote_df['TDate'] = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])
return quote_df
More error messages 更多错误讯息
E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE
E:\FinReporter\FM_EXT.py:450: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
quote_df['TAmt'] = quote_df['TAmt']/TAMT_SCALE
E:\FinReporter\FM_EXT.py:453: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
quote_df['TDate'] = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])
#1楼
参考:https://stackoom.com/question/1OXeg/如何处理Pandas中的SettingWithCopyWarning
#2楼
The SettingWithCopyWarning
was created to flag potentially confusing "chained" assignments, such as the following, which don't always work as expected, particularly when the first selection returns a copy . 创建SettingWithCopyWarning
是标记可能引起混淆的“链接”分配,例如以下分配,这些分配并非总是按预期工作,尤其是当第一个选择返回一个copy时 。 [see GH5390 and GH5597 for background discussion.] [有关背景讨论,请参阅GH5390和GH5597 。]
df[df['A'] > 2]['B'] = new_val # new_val not set in df
The warning offers a suggestion to rewrite as follows: 该警告提出了如下重写建议:
df.loc[df['A'] > 2, 'B'] = new_val
However, this doesn't fit your usage, which is equivalent to: 但是,这不适合您的用法,相当于:
df = df[df['A'] > 2]
df['B'] = new_val
While it's clear that you don't care about writes making it back to the original frame (since you overwrote the reference to it), unfortunately this pattern can not be differentiated from the first chained assignment example, hence the (false positive) warning. 很明显,您不必在意将其写回到原始框架的写操作(因为您重写了对它的引用),但是不幸的是,这种模式无法与第一个链式分配示例区分开,因此(误报)警告。 The potential for false positives is addressed in the docs on indexing , if you'd like to read further. 如果您想进一步阅读,可能会在建立索引的文档中解决误报的可能性。 You can safely disable this new warning with the following assignment. 您可以通过以下分配安全地禁用此新警告。
pd.options.mode.chained_assignment = None # default='warn'
#3楼
In general the point of the SettingWithCopyWarning
is to show users (and especially new users) that they may be operating on a copy and not the original as they think. 通常, SettingWithCopyWarning
是向用户(尤其是新用户)显示他们可能正在使用副本,而不是他们认为的那样。 There are false positives (IOW if you know what you are doing it could be ok ). 有误报(IOW如果你知道你在做什么,它可能是确定 )。 One possibility is simply to turn off the (by default warn ) warning as @Garrett suggest. 一种可能性就是按照@Garrett的建议简单地关闭(默认为警告 )警告。
Here is another option: 这是另一个选择:
In [1]: df = DataFrame(np.random.randn(5, 2), columns=list('AB'))
In [2]: dfa = df.ix[:, [1, 0]]
In [3]: dfa.is_copy
Out[3]: True
In [4]: dfa['A'] /= 2
/usr/local/bin/ipython:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
#!/usr/local/bin/python
You can set the is_copy
flag to False
, which will effectively turn off the check, for that object : 您可以将is_copy
标志设置为False
,这将有效地关闭该对象的检查:
In [5]: dfa.is_copy = False
In [6]: dfa['A'] /= 2
If you explicitly copy then no further warning will happen: 如果您明确复制,则不会发生进一步的警告:
In [7]: dfa = df.ix[:, [1, 0]].copy()
In [8]: dfa['A'] /= 2
The code the OP is showing above, while legitimate, and probably something I do as well, is technically a case for this warning, and not a false positive. OP在上面显示的代码是合法的,并且可能是我也可以做的,但从技术上讲,此警告是一种情况,不是误报。 Another way to not have the warning would be to do the selection operation via reindex
, eg 没有警告的另一种方法是通过reindex
进行选择操作,例如
quote_df = quote_df.reindex(columns=['STK', ...])
Or, 要么,
quote_df = quote_df.reindex(['STK', ...], axis=1) # v.0.21
#4楼
Pandas dataframe copy warning 熊猫数据框复制警告
When you go and do something like this: 当您去做这样的事情时:
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
pandas.ix
in this case returns a new, stand alone dataframe. 在这种情况下, pandas.ix
返回一个新的独立数据pandas.ix
。
Any values you decide to change in this dataframe, will not change the original dataframe. 您决定在此数据框中更改的任何值都不会更改原始数据框。
This is what pandas tries to warn you about. 这就是熊猫试图警告您的内容。
Why .ix
is a bad idea 为什么.ix
是个坏主意
The .ix
object tries to do more than one thing, and for anyone who has read anything about clean code, this is a strong smell. .ix
对象尝试做的事情不只一件事,而且对于任何阅读过干净代码的人来说,这是一种强烈的气味。
Given this dataframe: 给定此数据框:
df = pd.DataFrame({"a": [1,2,3,4], "b": [1,1,2,2]})
Two behaviors: 两种行为:
dfcopy = df.ix[:,["a"]]
dfcopy.a.ix[0] = 2
Behavior one: dfcopy
is now a stand alone dataframe. 行为一: dfcopy
现在是一个独立的数据dfcopy
。 Changing it will not change df
更改它不会更改df
df.ix[0, "a"] = 3
Behavior two: This changes the original dataframe. 行为二:更改原始数据框。
Use .loc
instead 使用.loc
代替
The pandas developers recognized that the .ix
object was quite smelly[speculatively] and thus created two new objects which helps in the accession and assignment of data. 熊猫开发者认识到.ix
对象很臭(推测地),因此创建了两个新对象,这些对象有助于数据的访问和分配。 (The other being .iloc
) (另一个是.iloc
)
.loc
is faster, because it does not try to create a copy of the data. .loc
更快,因为它不会尝试创建数据副本。
.loc
is meant to modify your existing dataframe inplace, which is more memory efficient. .loc
旨在就地修改您现有的数据帧,从而提高内存效率。
.loc
is predictable, it has one behavior. .loc
是可预测的,它具有一种行为。
The solution 解决方案
What you are doing in your code example is loading a big file with lots of columns, then modifying it to be smaller. 在代码示例中,您正在执行的操作是加载一个包含许多列的大文件,然后将其修改为较小的文件。
The pd.read_csv
function can help you out with a lot of this and also make the loading of the file a lot faster. pd.read_csv
函数可以帮助您解决很多问题,并使文件加载更快。
So instead of doing this 所以不要这样做
quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
Do this 做这个
columns = ['STK', 'TPrice', 'TPCLOSE', 'TOpen', 'THigh', 'TLow', 'TVol', 'TAmt', 'TDate', 'TTime']
df = pd.read_csv(StringIO(str_of_all), sep=',', usecols=[0,3,2,1,4,5,8,9,30,31])
df.columns = columns
This will only read the columns you are interested in, and name them properly. 这只会读取您感兴趣的列,并正确命名它们。 No need for using the evil .ix
object to do magical stuff. 无需使用邪恶的.ix
对象来做神奇的事情。
#5楼
If you have assigned the slice to a variable and want to set using the variable as in the following: 如果您已将切片分配给变量,并希望使用变量进行设置,如下所示:
df2 = df[df['A'] > 2]
df2['B'] = value
And you do not want to use Jeffs solution because your condition computing df2
is to long or for some other reason, then you can use the following: 而且您不想使用Jeffs解决方案,因为条件计算df2
太长或出于某些其他原因,那么您可以使用以下命令:
df.loc[df2.index.tolist(), 'B'] = value
df2.index.tolist()
returns the indices from all entries in df2, which will then be used to set column B in the original dataframe. df2.index.tolist()
返回df2中所有条目的索引,然后将这些索引用于设置原始数据帧中的B列。
#6楼
To remove any doubt, my solution was to make a deep copy of the slice instead of a regular copy. 为了消除任何疑问,我的解决方案是制作切片的深层副本,而不是常规副本。 This may not be applicable depending on your context (Memory constraints / size of the slice, potential for performance degradation - especially if the copy occurs in a loop like it did for me, etc...) 根据您的上下文,这可能不适用(内存限制/切片的大小,潜在的性能下降-尤其是如果复制像对我一样在一个循环中发生,等等。)
To be clear, here is the warning I received: 需要明确的是,这是我收到的警告:
/opt/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:54:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Illustration 插图
I had doubts that the warning was thrown because of a column I was dropping on a copy of the slice. 我怀疑是否由于我将一列放在切片的副本上而引发警告。 While not technically trying to set a value in the copy of the slice, that was still a modification of the copy of the slice. 虽然从技术上讲,它不是在切片副本中尝试设置值,但是这仍然是切片副本的修改。 Below are the (simplified) steps I have taken to confirm the suspicion, I hope it will help those of us who are trying to understand the warning. 以下是我为确认怀疑而采取的(简化)步骤,希望它能对那些试图了解警告的人有所帮助。
Example 1: dropping a column on the original affects the copy 示例1:在原件上放置一列会影响复印
We knew that already but this is a healthy reminder. 我们已经知道了,但这是健康的提醒。 This is NOT what the warning is about. 这不是警告是关于什么的。
>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1
A B
0 111 121
1 112 122
2 113 123
>> df2 = df1
>> df2
A B
0 111 121
1 112 122
2 113 123
# Dropping a column on df1 affects df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
B
0 121
1 122
2 123
It is possible to avoid changes made on df1 to affect df2 可以避免对df1进行更改以影响df2
>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1
A B
0 111 121
1 112 122
2 113 123
>> import copy
>> df2 = copy.deepcopy(df1)
>> df2
A B
0 111 121
1 112 122
2 113 123
# Dropping a column on df1 does not affect df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
A B
0 111 121
1 112 122
2 113 123
Example 2: dropping a column on the copy may affect the original 示例2:在副本上放置一列可能会影响原始
This actually illustrates the warning. 这实际上说明了警告。
>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1
A B
0 111 121
1 112 122
2 113 123
>> df2 = df1
>> df2
A B
0 111 121
1 112 122
2 113 123
# Dropping a column on df2 can affect df1
# No slice involved here, but I believe the principle remains the same?
# Let me know if not
>> df2.drop('A', axis=1, inplace=True)
>> df1
B
0 121
1 122
2 123
It is possible to avoid changes made on df2 to affect df1 可以避免对df2进行更改以影响df1
>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1
A B
0 111 121
1 112 122
2 113 123
>> import copy
>> df2 = copy.deepcopy(df1)
>> df2
A B
0 111 121
1 112 122
2 113 123
>> df2.drop('A', axis=1, inplace=True)
>> df1
A B
0 111 121
1 112 122
2 113 123
Cheers! 干杯!