df.groupby().first() 和 df.drop_duplicates() 去重方法对比

最新推荐文章于 2022-07-19 21:34:26 发布

babyjustsaidyes

最新推荐文章于 2022-07-19 21:34:26 发布

阅读量942

点赞数

分类专栏： Pandas 文章标签： python

本文链接：https://blog.csdn.net/weixin_43256057/article/details/121429479

版权

Pandas 专栏收录该内容

7 篇文章

订阅专栏

本文介绍了使用Python Pandas库进行数据去重的两种方法：groupby与drop_duplicates，并通过实例对比了这两种方法的效果，强调了排序对于比较结果的重要性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

dr = data[['recommend']]
drF = dr.groupby(dr.recommend).first().reset_index() # 162
dF = dr.drop_duplicates()
drF.values==dF.values

out:
array([[False],
[False],
[False],
…
[False],
[False]])

但通过我自己写的数组对比程序发现drF和dF的recommend元素是完全一样。猜测是因为它们的排序不同，导致drF.values==dF.values 这样做逐个死板对比是对不上的。要用：

drF.sort_values(by='recommend').values==dF.sort_values(by='recommend').values

out:
array([[ True],
[ True],
[ True],
…
[ True],
[ True]])

都是True，表示数组的值完全相同，证实了上面的猜想是对的。所以以后去重都可用这2种方法，但更推荐dr.drop_duplicates()，因为代码更简洁。

补充( 效果和上面一样，只是指定了要按recommend列来去重，可灵活运用)：

dr.drop_duplicates(subset='recommend')

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

babyjustsaidyes

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
分享

复制链接

分享到 QQ

分享到新浪微博

扫一扫
举报

举报

专栏目录

pandas使用groupby.first函数、groupby.nth(0)函数获取每个组中的第一个值实战：groupby.first函数和groupby.nth函数对比(对待NaN的差异)

data+scenario+science+insight

12-08

2227

pandas使用groupby.first函数、groupby.nth(0)函数获取每个组中的第一个值实战：groupby.first函数和groupby.nth函数对比(对待NaN的差异) 目录 pandas使用groupby.first函数、groupby.nth(0)函数获取每个组中的第一个值实战：groupby.first函数和groupby.nth函数对比(对待NaN的差异) #pandas使用groupby.first函数、groupby.nth(0)获取每个组中的第一个值语法 #.

pandas使用groupby函数、first函数、last函数分别获得每个分组的第一行和最后一行数据（first/last row of each group in dataframe）

data+scenario+science+insight

02-22

4779

pandas使用groupby函数、first函数、last函数分别获得每个分组的第一行和最后一行数据（first/lastrow of each group in dataframe）

参与评论您还未登录，请先登录后发表或查看评论

python first_python pandas-groupby.first()返回NaT值

weixin_39760434的博客

02-19

621

我正在检索以下数据框的cummax()值,exit_price trend netgain high low MFE_prexit_time2000-02-01 01:00:00 1400.25 -1 1.00 1401.50 1400.25 1400.252000-02-01 01:30:00 1400.75 -1 ...

mysql group by first_在mysql中只使用group by选择最后一个值

weixin_39853968的博客

01-19

563

I have one table with data about attendance into some events. I have in the table the data of the attendance everytime the user sends new attendance, the information is like this:mysql> SELECT id_b...

mysql group by first_MySQL之GROUP BY用法误解

weixin_31986143的博客

01-27

1860

1.说明“Group By”从字面意义上理解就是根据“By”指定的规则对数据进行分组，所谓的分组就是将一个“数据集”划分成若干个“小区域”，然后针对若干个“小区域”进行数据处理。(只是简单说明这个语句的作用，不是这篇文章的重点)2.使用举例：2.1表结构类型mysql> desc actor;+-------------+----------------------+------+-----...

groupby.nth_熊猫groupby first vs groupby nth vs groupby head

weixin_26705651的博客

08-21

479

groupby.nthOften there comes a need to compute operations on groups. But there are times when getting the first or the nth row from each group is the highest priority. 通常需要对组进行运算。但是有时候从每个组中获得第一行或第n行是...

Python源码09重复数据处理（df.drop_duplicates方法）.zip

热门推荐

白白NLP的博客

03-25

1万+

import pandas as pd data = pd.read_csv (u"C:\\Users\\...\\data.csv" , header=0, encoding = "GBK") new = pd.DataFrame() for column in ['销量','金额']: #'score_hownet','score_boson','score_1how','score_1...

import pandas as pd # 读取Excel文件 df = pd.read_excel('C:\\Users\\ASUS\\Desktop\\干部标签相同项目合并\\标签测试功能.xlsx') # 指定需要判重的字段和需要合并的字段 dup_cols = ['name', 'units_name', 'tag'] merge_col = 'evidence' #查找重复行 dup_rows = df.duplicated(subset=dup_cols, keep=False) # # 合并数据 # dup_data = df[dup_rows].groupby(dup_cols)[merge_col].apply(lambda x: '\n'.join(x)).reset_index(name=merge_col) # 将重复行进行分组，合并要合并的列 df[dup_rows].groupby(dup_cols)[merge_col].apply(lambda x: '\n'.join(x), inplace=True) # 重置索引列 df.reset_index(drop=True, inplace=True) # 删除重复行 df.drop_duplicates(subset=dup_cols, keep='first', inplace=True) # 合并数据 df = pd.merge(df, dup_data, on=dup_cols, how='left') # 将处理后的数据写入新的Excel文件 df.to_excel('C:\\Users\\ASUS\\Desktop\\干部标签相同项目合并\\new_file.xlsx', index=False)

06-01

df.drop_duplicates(subset=dup_cols, keep='first', inplace=True) # 将处理后的数据写入新的Excel文件 df.to_excel('C:\\Users\\ASUS\\Desktop\\干部标签相同项目合并\\new_file.xlsx', index=False) ``` 以上...

python series去重_pandas中DataFrame和Series的数据去重

weixin_35252187的博客

03-01

2990

在SQL语言中去重是一件相当简单的事情，面对一个表(也可以称之为DataFrame)我们对数据进行去重只需要GROUP BY 就好。 select custId,applyNo from tmp.online_service_startloan group by custId,applyNo1.DataFrame去重但是对于pandas的DataFrame格式就比较麻烦，我看了其他博客优化了如...

Django中使用group_by的方法

09-22

主要介绍了Django中使用group_by的方法,实例分析了在Django中使用group_by语句的相关技巧,需要的朋友可以参考下

Django框架models使用group by详解

09-17

主要介绍了Django框架models使用group by详解，具有很好的参考价值，希望对大家有所帮助。一起跟随小编过来看看吧

pyodps中一些经验：取分组排序后的第一条数据

Chenftli的博客

01-24

5978

pyodps中有很多本来在pandas中一个API解决的东西却要想半天才能搞定。 pandas中在groupby后只要用first就可以去出分组后的第一行。例如： # 以student_id为分组列，然后取出分组后每组的第一条数据 df_stu_frist_course = df_stu_course.groupby('student_id').first() 然而pyodps中却很坑...

df.drop_duplicates() 详解+用法

c_lanxiaofang的博客

07-19

9494

1、不定义任何参数，完全删除重复的行数据2、去除重复的几列行数据。drop_duplicates(self, subset: 'Optional[Union[Hashable, Sequence[Hashable]]]' = None, keep: 'Union[str, bool]' = 'first', inplace: 'bool' = False, ignore_index: 'bool' = False) 返回： DataFrame with duplicate row

dataframe 分组groupby显示方法 (单纯显示，无其它操作如sum，mean)

weixin_43256057的博客

07-03

3671

#总结这是纯显示分组的最佳方法 df.groupby('id').apply(lambda x:x[:]).drop(axis=1,columns='id',inplace=False) import pandas as pd d = {'id':[1,1,2,3,3,4,3,4,1,5],'product':['p2','p1','p3','p2','p1','p3','p4','p5','p6','p5']} df = pd.DataFrame.from_dict(d).reset_index(d

Pandas---去重drop_duplicates和duplicated

weixin_43849761的博客

06-27

357

今天分享一个去重在项目中的小应用主要有三列：时间，ID，等级重复情况：可能同一天同一个ID对应两个不同的等级目标是：找出重复情况对应的时间和ID # 对合并后的数据进行处理 data_range.drop_duplicates(inplace=True) # 对档位数据整体去重，即将同一天同一ID等级相同的去重 print(data_range.info()) # 去...

Excel 2010 SQL应用117 分组统计之GROUP BY 与First

复行数十步

04-06

809

目录数据源解决方案 GROUP BY+FIRST函数+LAST函数的使用数据源单位编号拨款月享受月份姓名性别个人编码医保卡号退款总额基本补充补助异地城市 TCC1001 201004 201001 林子男 04251011 00103220 62.73 62.73 0 0 广州 TCC1001 201004 2010

讲解python中groupby()的应用及groupby案例分析

weixin_46713695的博客

07-09

1万+

讲解python中groupby()的应用