I'm having a dataframe and I'd like to do a groupby() based a column and then sort the values within each group based on a date column. Then, from a each I'd like to remove records whose value for column_condition == 'B' until I reach to a row whose column_condition == 'A'. For example, Assume the table below is one of the groups
ID, DATE, column_condition
--------------------------
1, jan 2017, B
1, Feb 2017, B
1, Mar 2017, B
1, Aug 2017, A
1, Sept 2017, B
So, I'd like to remove the first three rows and leave this group with only the last two rows. How can I do that?
解决方案
I think I finally understand your question: you wish to groupby a dataframe by 'ID', sort by date, and keep the rows after the first ocurrence of 'A' in your condition column. I've come up with the following one liner solution:
Setting up dummy data
import pandas as pd
import datetime as dt
d = {
'ID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], # Assuming only two unique IDs for simplicity
'DATE': [ # Dates already sorted, but it would work anyways
dt.date(2018, 7, 19), dt.date(2018, 8, 18),
dt.date(2018, 9, 17), dt.date(2018, 10, 17),
dt.date(2018, 11, 16), dt.date(2018, 7, 19),
dt.date(2018, 8, 18), dt.date(2018, 9, 17),
dt.date(2018, 10, 17), dt.date(2018, 11, 16)
],
'condition': ['B', 'B', 'B', 'A', 'B', 'B', 'B', 'B', 'A', 'B']
}
# 'DATE' but with list comprehension:
# [dt.date.today() + dt.timedelta(days=30*x) for y in range(0, 2) for x in range(0, 5)]
df = pd.DataFrame(d)
Interpreter
>>> (df.sort_values(by='DATE') # we should call pd.to_datetime() first if...
... .groupby('ID') # 'DATE' is not datetime already
... .apply(lambda x: x[(x['condition'].values == 'A').argmax():]))
ID DATE condition
ID
1 3 1 2018-10-17 A
4 1 2018-11-16 B
2 8 2 2018-10-17 A
9 2 2018-11-16 B
You can also call reset_index(drop=True), if you need something like this:
ID DATE condition
0 1 2018-10-17 A
1 1 2018-11-16 B
2 2 2018-10-17 A
3 2 2018-11-16 B
(x['condition'].values == 'A') returns a bool np.array, and calling argmax() gives us then index where the first ocurrence of True happens (where condition == 'A' in this case). Using that index, we're subsetting each of the groups with a slice.
EDIT: Added filter for dealing with groups that only contain the undesired condition.
d = {
'ID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], # Assuming only two unique IDs for simplicity
'DATE': [ # Dates already sorted, but it would work anyways
dt.date(2018, 7, 19), dt.date(2018, 8, 18),
dt.date(2018, 9, 17), dt.date(2018, 10, 17),
dt.date(2018, 11, 16), dt.date(2018, 7, 19),
dt.date(2018, 8, 18), dt.date(2018, 9, 17),
dt.date(2018, 10, 17), dt.date(2018, 11, 16)
], # ID 1 only contains 'B'
'condition': ['B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'A', 'B']
}
df = pd.DataFrame(d)
Interpreter
>>> df
ID DATE condition
0 1 2018-07-19 B
1 1 2018-08-18 B
2 1 2018-09-17 B
3 1 2018-10-17 B
4 1 2018-11-16 B
5 2 2018-07-19 B
6 2 2018-08-18 B
7 2 2018-09-17 B
8 2 2018-10-17 A
9 2 2018-11-16 B
>>> (df.sort_values(by='DATE')
... .groupby('ID')
... .filter(lambda x: (x['condition'] == 'A').any())
... .groupby('ID')
... .apply(lambda x: x[(x['condition'].values == 'A').argmax():]))
ID DATE condition
ID
2 8 2 2018-10-17 A
9 2 2018-11-16 B