22. Pandas的数据清洗-删除重复
在Pandas里有duplicated函数可以查询到数据里是否有重复的数据,可以用drop_duplicates函数删除重复数据。
import pandas as pd
import numpy as np
col = ["apple", "pearl", "watermelon"] * 4
pri = [2.50, 3.00, 2.75] * 4
df = pd.DataFrame({"fruit": col, "price" : pri})
print df
print df.duplicated()
print df.drop_duplicates()
程序的执行结果:
fruit price
0 apple 2.50
1 pearl 3.00
2 watermelon 2.75
3 apple 2.50
4 pearl 3.00
5 watermelon 2.75
6 apple 2.50
7 pearl 3.00
8 watermelon 2.75
9 apple 2.50
10 pearl 3.00
11 watermelon 2.75
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
dtype: bool
fruit price
0 apple 2.50
1 pearl 3.00
2 watermelon 2.75
如果想影响dataframe本身,启用函数的inplace=True参数。
如果想保留重复出现最后出现的数据可以使用keep参数。
import pandas as pd
import numpy as np
col = ["apple", "pearl", "watermelon"] * 4
pri = [2.50, 3.00, 2.75] * 4
df = pd.DataFrame({"fruit": col, "price" : pri})
print df
print df.duplicated()
print df.drop_duplicates()
print df.drop_duplicates(keep="last")
程序执行结果:
fruit price
0 apple 2.50
1 pearl 3.00
2 watermelon 2.75
3 apple 2.50
4 pearl 3.00
5 watermelon 2.75
6 apple 2.50
7 pearl 3.00
8 watermelon 2.75
9 apple 2.50
10 pearl 3.00
11 watermelon 2.75
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
dtype: bool
fruit price
0 apple 2.50
1 pearl 3.00
2 watermelon 2.75
fruit price
9 apple 2.50
10 pearl 3.00
11 watermelon 2.75