python删除第一行数据_Python:Pandas-按组删除第一行

I have the following large dataframe (df) that looks like this:

ID date PRICE

1 10001 19920103 14.500

2 10001 19920106 14.500

3 10001 19920107 14.500

4 10002 19920108 15.125

5 10002 19920109 14.500

6 10002 19920110 14.500

7 10003 19920113 14.500

8 10003 19920114 14.500

9 10003 19920115 15.000

Question: What's the most efficient way to delete (or remove) the first row of each ID? I want this:

ID date PRICE

2 10001 19920106 14.500

3 10001 19920107 14.500

5 10002 19920109 14.500

6 10002 19920110 14.500

8 10003 19920114 14.500

9 10003 19920115 15.000

I can do a loop over each unique ID and remove the first row but I believe this is not very efficient.

解决方案

You could use groupby/transform to prepare a boolean mask which is True for the rows you want and False for the rows you don't want. Once you have such a boolean mask, you can select the sub-DataFrame using df.loc[mask]:

import numpy as np

import pandas as pd

df = pd.DataFrame(

{'ID': [10001, 10001, 10001, 10002, 10002, 10002, 10003, 10003, 10003],

'PRICE': [14.5, 14.5, 14.5, 15.125, 14.5, 14.5, 14.5, 14.5, 15.0],

'date': [19920103, 19920106, 19920107, 19920108, 19920109, 19920110,

19920113, 19920114, 19920115]},

index = range(1,10))

def mask_first(x):

result = np.ones_like(x)

result[0] = 0

return result

mask = df.groupby(['ID'])['ID'].transform(mask_first).astype(bool)

print(df.loc[mask])

yields

ID PRICE date

2 10001 14.5 19920106

3 10001 14.5 19920107

5 10002 14.5 19920109

6 10002 14.5 19920110

8 10003 14.5 19920114

9 10003 15.0 19920115

Since you're interested in efficiency, here is a benchmark:

import timeit

import operator

import numpy as np

import pandas as pd

N = 10000

df = pd.DataFrame(

{'ID': np.random.randint(100, size=(N,)),

'PRICE': np.random.random(N),

'date': np.random.random(N)})

def using_mask(df):

def mask_first(x):

result = np.ones_like(x)

result[0] = 0

return result

mask = df.groupby(['ID'])['ID'].transform(mask_first).astype(bool)

return df.loc[mask]

def using_apply(df):

return df.groupby('ID').apply(lambda group: group.iloc[1:, 1:])

def using_apply_alt(df):

return df.groupby('ID', group_keys=False).apply(lambda x: x[1:])

timing = dict()

for func in (using_mask, using_apply, using_apply_alt):

timing[func] = timeit.timeit(

'{}(df)'.format(func.__name__),

'from __main__ import df, {}'.format(func.__name__), number=100)

for func, t in sorted(timing.items(), key=operator.itemgetter(1)):

print('{:16}: {:.2f}'.format(func.__name__, t))

reports

using_mask : 0.85

using_apply_alt : 2.04

using_apply : 3.70

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值