导入需要的包:numpy、pandas
import numpy as py
import pandas as pd
创建一个表:df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006],
"date":pd.date_range('20130102', periods=6),
"city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'Beijing '],
"age":[23,44,54,32,34,32],
"category":['100-A','100-B','110-A','110-C','210-A','130-F'],
"price":[1200,np.nan,2133,5433,np.nan,4432]},
columns =['id','date','city','category','age','price'])
得到如下表:
Python处理重复数据
drop_duplicates函数删除重复值。以city列为例,city字段中存在重复值。默认情况下drop_duplicates()将删除后出现的重复值。增加keep=‘last’参数后将删除最先出现的重复值,保留最后的值。下面是具体的代码和比较结果。df["city"].drop_duplicates()保