I have a dataframe which contains duplicates values according to two columns (A and B):
A B C
1 2 1
1 2 4
2 7 1
3 4 0
3 4 8
I want to remove duplicates keeping the row with max value in column C. This would lead to:
A B C
1 2 4
2 7 1
3 4 8
I cannot figure out how to do that. Should I use drop_duplicates(), something else?
解决方案
You can do it using group by:
c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]
c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.
Another approach using drop_duplicates would be
df.sort