1. Duplicates (重复值)
1.1 Check Duplicates (检查重复值)
import pandas as pd
data = {'水果':['苹果','梨','草莓','草莓'],
'数量':[3,2,5,5],
'价格':[10,9,8,10]}
df = pd.DataFrame(data)
df
筛选出水果和价格两列有重复的数据:
df[df.duplicated(subset=['水果', '数量'])]
统计重复数据的个数:
df.duplicated(subset=['水果', '数量']).sum()
output: 1
1.2 Deduplicate (删除重复值)
deduplicated_df = df.drop_duplicates(subset=['水果', '数量'], keep='first') # 'first'保留重复数据的第一个样本
deduplicated_df
2. Overlap (重叠数据)
重叠数据是指在两个dataframe中同时出现的数据。比如,在准备机器学习数据集时,我们通常需要判断训练集、验证集和测试集之间没有太多重叠数据。
2.1 Check Overlap (检查两个dataframe是否含有重叠数据)
通过计算两个dataframe之间的交集实现:
#Create a DataFrame
df1 = {
'Subject':['semester1','semester2','semester3','semester4','semester1',
'semester2','semester3'],
'Score':[62,47,55,74,31,77,85]}
df2 = {
'Subject':['semester1','semester2','semester3','semester4'],
'Score':[90,47,85,74]}
df1 = pd.DataFrame(df1,columns=['Subject','Score'])
df2 = pd.DataFrame(df2,columns=['Subject','Score'])
df1:
df2:
# 两个dataframe的交集
intersected_df = pd.merge(df1, df2, how='inner')
intersected_df
2.2 Drop Overlap (删除存在于另一个dataframe的数据)
删除重叠部分的数据,e.g.删除df1中的重叠数据,只需进行减法运算:df1-intersected_df
drop_overlap_df = pd.concat([df1, intersected_df, intersected_df]).drop_duplicates(keep=False)
drop_overlap_df