在进行数据分析时, 为获取某个完整的panel数据往往需要将多个文件进行整合. 而当这些文件之间存在重叠的时候, 单纯的拼接难以得到好的结果. 本文将以pandas重复值处理的若干函数着手, 分析最好的文件重复值处理方法.
首先, pandas中处理重复值主要使用的是两个函数:
- duplicated():判断是否有重复值
- drop_duplicates() :删除重复值
对这两个函数的解释
Help on method duplicated in module pandas.core.frame:
duplicated函数的解释
duplicated(
subset: 'Hashable | Sequence[Hashable] | None' = None,
keep: "Literal['first'] | Literal['last'] | Literal[False]" = 'first'
) -> 'Series' method of pandas.core.frame.DataFrame instance
Return boolean Series denoting duplicate rows. Considering certain columns is optional.
参数
- subset : column label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns.
- keep : {'first', 'last', False}, default 'first'. Determines which duplicates (if any) to mark.
- - ``first`` : Mark duplicates as ``True`` except for the first occurrence.
- - ``last`` : Mark duplicates as ``True`` except for the last occurrence.
- - False : Mark all duplicates as ``True``.
返回
Series
Boolean series for each duplicated rows.
参见
Index.duplicated : Equivalent method on index.
Series.duplicated : Equivalent method on Series.
Series.drop_duplicates : Remove duplicate values from Series.
DataFrame.drop_duplicates : Remove duplicate values from DataFrame.
例子
考虑一个包含拉面评分的数据集:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]
})
df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
默认下, 对于每个重复值的set, 第一次出现的set会被设为False, 所有其他的会设为True.
df.duplicated()
0 False#重复值 此为第一次 标记为False
1 True#重复值 此为第二次 标记为True
2 False
3 False
4 False
dtype: bool
通过设定keep为'last', 对于每个重复值的set, 最后出现的set会被设为False, 所有其他的会设为True.
df.duplicated(keep='last')
0 True#重复值 此为第一次 标记为True
1 False#重复值 此为第二次 标记为False
2 False
3 False
4 False
dtype: bool
当keep设为False, 所有的重复值都会赋为True
df.duplicated(keep=False)
0 True
1 True
2 False
3 False
4 False
dtype: bool
若要找到特定column(s), 使用subset参数.
df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
df.duplicated(subset=['brand'])
0 False
1 True
2 False
3 True
4 True
dtype: bool
Help on method drop_duplicates in module pandas.core.frame:
drop_duplicates函数的解释
drop_duplicates(
subset: 'Hashable | Sequence[Hashable] | None' = None,
keep: "Literal['first'] | Literal['last'] | Literal[False]" = 'first',
inplace: 'bool' = False,
ignore_index: 'bool' = False
) -> 'DataFrame | None' method of pandas.core.frame.DataFrame instance
Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes are ignored.
参数
- subset : column label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns.
- keep : {'first', 'last', False}, default 'first'. Determines which duplicates (if any) to keep.
- - ``first`` : Drop duplicates except for the first occurrence.
- - ``last`` : Drop duplicates except for the last occurrence.
- - False : Drop all duplicates.
- inplace : bool, default False. Whether to drop duplicates in place or to return a copy.
- ignore_index : bool, default False. If True, the resulting axis will be labeled 0, 1, ..., n - 1.
返回
DataFrame or None
DataFrame with duplicates removed or None if ``inplace=True``.
参见
DataFrame.value_counts: Count unique combinations of columns.
例子
同样用上述df, 默认下, 该函数会移除基于所有列有重复的行
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]
})
df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
df.drop_duplicates()
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
若要基于特定column(s)进行移除操作, 使用subset参数
df.drop_duplicates(subset=['brand'])
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
若要保留所有最后出现的重复行以进行移除操作, 使用keep参数
df.drop_duplicates(subset=['brand', 'style'], keep='last')
brand style rating
1 Yum Yum cup 4.0
2 Indomie cup 3.5
4 Indomie pack 5.0
应用于panel数据的处理
如下所示, df1为待更新的数据, df2为已更新的数据. 可见df1与df2在2022-05月的06, 09, 10日的数据是不相同的. 进而拼接后要求以更新的数据为准, 即df1在2022-05月的06, 09, 10日的数据要移除.
import pandas as pd
df1
date val
0 2022/4/28 2.490852
1 2022/4/29 2.594890
2 2022/5/5 2.637793
3 2022/5/6 2.598594
4 2022/5/9 2.613143
5 2022/5/10 2.664768
df2
date val
0 2022/5/6 2.600081
1 2022/5/9 2.615622
2 2022/5/10 2.667454
3 2022/5/11 2.708073
4 2022/5/12 2.710368
5 2022/5/13 2.734073
拼接后, 设定subset为'date', keep为'last', 最后设置'date'为index以保持index的序列性
df3 = df1.append(df2)
df3
date val
0 2022/4/28 2.490852
1 2022/4/29 2.594890
2 2022/5/5 2.637793
3 2022/5/6 2.598594 #
4 2022/5/9 2.613143 #
5 2022/5/10 2.664768 #
0 2022/5/6 2.600081
1 2022/5/9 2.615622
2 2022/5/10 2.667454
3 2022/5/11 2.708073
4 2022/5/12 2.710368
5 2022/5/13 2.734073
#上述df中标#的行为在以下应用drop_duplicates后会被移除的行.
df3 = df3.drop_duplicates(subset='date', keep='last')
df3
date val
0 2022/4/28 2.490852
1 2022/4/29 2.594890
2 2022/5/5 2.637793
0 2022/5/6 2.600081
1 2022/5/9 2.615622
2 2022/5/10 2.667454
3 2022/5/11 2.708073
4 2022/5/12 2.710368
5 2022/5/13 2.734073
#设定'date'为index后, 数据会更加规整
df3.set_index('date')
val
date
2022/4/28 2.490852
2022/4/29 2.594890
2022/5/5 2.637793
2022/5/6 2.600081
2022/5/9 2.615622
2022/5/10 2.667454
2022/5/11 2.708073
2022/5/12 2.710368
2022/5/13 2.734073