python删除第一行数据_Python：Pandas-按组删除第一行

weixin_40006763

于 2020-12-06 07:09:48 发布

阅读量894

点赞数

文章标签： python删除第一行数据

I have the following large dataframe (df) that looks like this:

ID date PRICE

1 10001 19920103 14.500

2 10001 19920106 14.500

3 10001 19920107 14.500

4 10002 19920108 15.125

5 10002 19920109 14.500

6 10002 19920110 14.500

7 10003 19920113 14.500

8 10003 19920114 14.500

9 10003 19920115 15.000

Question: What's the most efficient way to delete (or remove) the first row of each ID? I want this:

ID date PRICE

2 10001 19920106 14.500

3 10001 19920107 14.500

5 10002 19920109 14.500

6 10002 19920110 14.500

8 10003 19920114 14.500

9 10003 19920115 15.000

I can do a loop over each unique ID and remove the first row but I believe this is not very efficient.

解决方案

You could use groupby/transform to prepare a boolean mask which is True for the rows you want and False for the rows you don't want. Once you have such a boolean mask, you can select the sub-DataFrame using df.loc[mask]:

import numpy as np

import pandas as pd

df = pd.DataFrame(

{'ID': [10001, 10001, 10001, 10002, 10002, 10002, 10003, 10003, 10003],

'PRICE': [14.5, 14.5, 14.5, 15.125, 14.5, 14.5, 14.5, 14.5, 15.0],

'date': [19920103, 19920106, 19920107, 19920108, 19920109, 19920110,

19920113, 19920114, 19920115]},

index = range(1,10))

def mask_first(x):

result = np.ones_like(x)

result[0] = 0

return result

mask = df.groupby(['ID'])['ID'].transform(mask_first).astype(bool)

print(df.loc[mask])

yields

ID PRICE date

2 10001 14.5 19920106

3 10001 14.5 19920107

5 10002 14.5 19920109

6 10002 14.5 19920110

8 10003 14.5 19920114

9 10003 15.0 19920115

Since you're interested in efficiency, here is a benchmark:

import timeit

import operator

import numpy as np

import pandas as pd

N = 10000

df = pd.DataFrame(

{'ID': np.random.randint(100, size=(N,)),

'PRICE': np.random.random(N),

'date': np.random.random(N)})

def using_mask(df):

def mask_first(x):

result = np.ones_like(x)

result[0] = 0

return result

mask = df.groupby(['ID'])['ID'].transform(mask_first).astype(bool)

return df.loc[mask]

def using_apply(df):

return df.groupby('ID').apply(lambda group: group.iloc[1:, 1:])

def using_apply_alt(df):

return df.groupby('ID', group_keys=False).apply(lambda x: x[1:])

timing = dict()

for func in (using_mask, using_apply, using_apply_alt):

timing[func] = timeit.timeit(

'{}(df)'.format(func.__name__),

'from __main__ import df, {}'.format(func.__name__), number=100)

for func, t in sorted(timing.items(), key=operator.itemgetter(1)):

print('{:16}: {:.2f}'.format(func.__name__, t))

reports

using_mask : 0.85

using_apply_alt : 2.04

using_apply : 3.70

weixin_40006763

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python删除第一行数据_Python：Pandas-按组删除第一行

I have the following large dataframe (df) that looks like this:ID date PRICE1 10001 19920103 14.5002 10001 19920106 14.5003 10001 19920107 14.5004 10002 19920108 15.1255 1...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。