python df删除特定行,如何在python中删除组中的某些行

I'm having a dataframe and I'd like to do a groupby() based a column and then sort the values within each group based on a date column. Then, from a each I'd like to remove records whose value for column_condition == 'B' until I reach to a row whose column_condition == 'A'. For example, Assume the table below is one of the groups

ID, DATE, column_condition

--------------------------

1, jan 2017, B

1, Feb 2017, B

1, Mar 2017, B

1, Aug 2017, A

1, Sept 2017, B

So, I'd like to remove the first three rows and leave this group with only the last two rows. How can I do that?

解决方案

I think I finally understand your question: you wish to groupby a dataframe by 'ID', sort by date, and keep the rows after the first ocurrence of 'A' in your condition column. I've come up with the following one liner solution:

Setting up dummy data

import pandas as pd

import datetime as dt

d = {

'ID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], # Assuming only two unique IDs for simplicity

'DATE': [ # Dates already sorted, but it would work anyways

dt.date(2018, 7, 19), dt.date(2018, 8, 18),

dt.date(2018, 9, 17), dt.date(2018, 10, 17),

dt.date(2018, 11, 16), dt.date(2018, 7, 19),

dt.date(2018, 8, 18), dt.date(2018, 9, 17),

dt.date(2018, 10, 17), dt.date(2018, 11, 16)

],

'condition': ['B', 'B', 'B', 'A', 'B', 'B', 'B', 'B', 'A', 'B']

}

# 'DATE' but with list comprehension:

# [dt.date.today() + dt.timedelta(days=30*x) for y in range(0, 2) for x in range(0, 5)]

df = pd.DataFrame(d)

Interpreter

>>> (df.sort_values(by='DATE') # we should call pd.to_datetime() first if...

... .groupby('ID') # 'DATE' is not datetime already

... .apply(lambda x: x[(x['condition'].values == 'A').argmax():]))

ID DATE condition

ID

1 3 1 2018-10-17 A

4 1 2018-11-16 B

2 8 2 2018-10-17 A

9 2 2018-11-16 B

You can also call reset_index(drop=True), if you need something like this:

ID DATE condition

0 1 2018-10-17 A

1 1 2018-11-16 B

2 2 2018-10-17 A

3 2 2018-11-16 B

(x['condition'].values == 'A') returns a bool np.array, and calling argmax() gives us then index where the first ocurrence of True happens (where condition == 'A' in this case). Using that index, we're subsetting each of the groups with a slice.

EDIT: Added filter for dealing with groups that only contain the undesired condition.

d = {

'ID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], # Assuming only two unique IDs for simplicity

'DATE': [ # Dates already sorted, but it would work anyways

dt.date(2018, 7, 19), dt.date(2018, 8, 18),

dt.date(2018, 9, 17), dt.date(2018, 10, 17),

dt.date(2018, 11, 16), dt.date(2018, 7, 19),

dt.date(2018, 8, 18), dt.date(2018, 9, 17),

dt.date(2018, 10, 17), dt.date(2018, 11, 16)

], # ID 1 only contains 'B'

'condition': ['B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'A', 'B']

}

df = pd.DataFrame(d)

Interpreter

>>> df

ID DATE condition

0 1 2018-07-19 B

1 1 2018-08-18 B

2 1 2018-09-17 B

3 1 2018-10-17 B

4 1 2018-11-16 B

5 2 2018-07-19 B

6 2 2018-08-18 B

7 2 2018-09-17 B

8 2 2018-10-17 A

9 2 2018-11-16 B

>>> (df.sort_values(by='DATE')

... .groupby('ID')

... .filter(lambda x: (x['condition'] == 'A').any())

... .groupby('ID')

... .apply(lambda x: x[(x['condition'].values == 'A').argmax():]))

ID DATE condition

ID

2 8 2 2018-10-17 A

9 2 2018-11-16 B

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值