I recently start working with pandas. Can anyone explain me difference in behaviour of function .corrwith() with Series and DataFrame?
Suppose i have one DataFrame:
frame = pd.DataFrame(data={'a':[1,2,3], 'b':[-1,-2,-3], 'c':[10, -10, 10]})
And i want calculate correlation between features 'a' and all other features.
I can do it in the following way:
frame.drop(labels='a', axis=1).corrwith(frame['a'])
And result will be:
b -1.0
c 0.0
But very similar code:
frame.drop(labels='a', axis=1).corrwith(frame[['a']])
Generate absolutely different and unacceptable table:
a NaN
b NaN
c NaN
So, my question is: why in case of DataFrame as second argument we get such strange output?
解决方案
What I think you're looking for:
Let's say your frame is:
frame = pd.DataFrame(np.random.rand(10, 6), columns=['cost', 'amount', 'day', 'month', 'is_sale', 'hour'])
You want the 'cost' and 'amount' columns to be correlated with all other columns in every combination.
focus_cols = ['cost', 'amount']
frame.corr().filter(focus_cols).drop(focus_cols)
Answering what you asked:
Compute pairwise
correlation between rows or columns of two DataFrame objects.
Parameters:
other : DataFrame
axis : {0 or ‘index’, 1 or ‘columns’},
default 0 0 or ‘index’ to compute column-wise, 1 or ‘columns’ for row-wise drop : boolean, default False Drop missing indices from
result, default returns union of all Returns: correls : Series
corrwith is behaving similarly to add, sub, mul, div in that it expects to find a DataFrame or a Series being passed in other despite the documentation saying just DataFrame.
When other is a Series it broadcast that series and matches along the axis specified by axis, default is 0. This is why the following worked:
frame.drop(labels='a', axis=1).corrwith(frame.a)
b -1.0
c 0.0
dtype: float64
When other is a DataFrame it will match the axis specified by axis and correlate each pair identified by the other axis. If we did:
frame.drop('a', axis=1).corrwith(frame.drop('b', axis=1))
a NaN
b NaN
c 1.0
dtype: float64
Only c was in common and only c had its correlation calculated.
In the case you specified:
frame.drop(labels='a', axis=1).corrwith(frame[['a']])
frame[['a']] is a DataFrame because of the [['a']] and now plays by the DataFrame rules in which its columns must match up with what its being correlated with. But you explicitly drop a from the first frame then correlate with a DataFrame with nothing but a. The result is NaN for every column.