Python方法sort_values踩坑记录
我的需求是:在使用sort_values对CSV数据中的某一特定列进行排序,然后比较两个文件是否一致
在使用sort_values后,我使用遍历方法比较CSV文件数据。但是无论如何遍历,我都没办法正确比较结果(即文件数据一致,但是就是比对错误)
这个问题和sort_values的特征有关,我在此记录一下
一、原先的处理方法
1.读取DataFrame数据
2.使用sort_values排序
3.遍历DataFrame并比对结果
效果:比对结果出错
举个栗子
import pandas as pd
if __name__ == '__main__':
# 创建DataFrame
cols = ["stu_id", "name", "age", "sex"]
table1 = pd.DataFrame(columns=cols)
#填入数据
table1["stu_id"] = [13, 22, 3, 34, 15]
table1["name"] = ["Tom", "Jim", "Alice", "Bob", "John"]
table1["age"] = [20, 22, 19, 23, 21]
table1["sex"] = ["male", "male", "female", "male", "female"]
#根据stu_id排序
table1.sort_values(by="stu_id", ascending=True, inplace=True)
#同table1
table2 = pd.DataFrame(columns=cols)
table2["stu_id"] = [13, 34, 15, 22, 3]
table2["name"] = ["Tom", "Bob", "John", "Jim", "Alice"]
table2["age"] = [20, 23, 21, 22, 19]
table2["sex"] = ["male", "male", "female", "male", "female"]
table2.sort_values(by="stu_id", ascending=True, inplace=True)
#判断结果
flag = True
for i in range(0, 5):
for head in table1.columns:
if table1[head][i] != table2[head][i]:
flag = False
print("The data in the [{}] column is different".format(head))
print("index: " + str(i) + " with data:" + str(table1[head][i]) + " " + str(table2[head][i]))
if flag:
print("The data in the table is the same")
明明只是顺序不同的数据,根据stu_id排序后本应相同了,可是结果就是错的,如下:
#其实不看也可以
The data in the [stu_id] column is different
index: 1 with data:22 34
The data in the [name] column is different
index: 1 with data:Jim Bob
The data in the [age] column is different
index: 1 with data:22 23
The data in the [stu_id] column is different
index: 2 with data:3 15
The data in the [name] column is different
index: 2 with data:Alice John
The data in the [age] column is different
index: 2 with data:19 21
The data in the [stu_id] column is different
index: 3 with data:34 22
The data in the [name] column is different
index: 3 with data:Bob Jim
The data in the [age] column is different
index: 3 with data:23 22
The data in the [stu_id] column is different
index: 4 with data:15 3
The data in the [name] column is different
index: 4 with data:John Alice
The data in the [age] column is different
index: 4 with data:21 19
二、原因
仔细看了下对不上的行数,都是排序后更换位置的数据。
打印一下table1和table2,发现:
table1:
stu_id name age sex
2 3 Alice 19 female
0 13 Tom 20 male
4 15 John 21 female
1 22 Jim 22 male
3 34 Bob 23 male
table2:
stu_id name age sex
4 3 Alice 19 female
0 13 Tom 20 male
2 15 John 21 female
3 22 Jim 22 male
1 34 Bob 23 male
可以看到DataFrame的默认排序id还在,并且两行排序id并不相同(因为本来顺序就不同)。猜测遍历时使用的id顺序还是原id,也就是说sort_values的排序没有生效。
那么我们只要把df默认的id顺序重新改掉就好了。
三、解决办法
在遍历数据之前添加代码:
table1=table1.reset_index(drop=True)
table2=table2.reset_index(drop=True)
这个意思是丢弃原先的index,但是因为df就是会自动生成index,所以丢弃后它会重新排序,变成:
table1:
stu_id name age sex
0 3 Alice 19 female
1 13 Tom 20 male
2 15 John 21 female
3 22 Jim 22 male
4 34 Bob 23 male
table2:
stu_id name age sex
0 3 Alice 19 female
1 13 Tom 20 male
2 15 John 21 female
3 22 Jim 22 male
4 34 Bob 23 male
然后遍历的时候排序就正常了。
当然,如果不想丢弃原先的index,可以新建一个df存放数据(要注意表头的head可能会增加一个untitled的表头,就是刚刚的这个index,可以用drop删除掉)
总结
遇到这个问题的时候我甚至都有点怀疑人生,觉得:啊?我难道连一个遍历都写不好吗?
在生无可恋地检查df的时候(第一百次觉得自己没错),我阴差阳错发现如果把数据存进CSV,再读出来,问题就解决了,然后才想到出问题的地方。
总之教训就是:在用不熟悉的函数和功能的时候,要先好好了解,不能拿来就用。