【Python踩坑记录】

ProgrammingCatTheCat

已于 2024-04-19 14:38:50 修改

阅读量233

点赞数 4

分类专栏： Pandas Python 文章标签： python pandas bug

于 2024-03-19 15:57:10 首次发布

本文链接：https://blog.csdn.net/hin23/article/details/136844089

版权

Pandas 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

Python

1 篇文章 0 订阅

订阅专栏

Python方法sort_values踩坑记录

我的需求是：在使用sort_values对CSV数据中的某一特定列进行排序，然后比较两个文件是否一致
在使用sort_values后，我使用遍历方法比较CSV文件数据。但是无论如何遍历，我都没办法正确比较结果（即文件数据一致，但是就是比对错误）
这个问题和sort_values的特征有关，我在此记录一下

一、原先的处理方法

1.读取DataFrame数据
2.使用sort_values排序
3.遍历DataFrame并比对结果
效果：比对结果出错

举个栗子

import pandas as pd

if __name__ == '__main__':
	# 创建DataFrame
    cols = ["stu_id", "name", "age", "sex"]
    table1 = pd.DataFrame(columns=cols)
	#填入数据
    table1["stu_id"] = [13, 22, 3, 34, 15]
    table1["name"] = ["Tom", "Jim", "Alice", "Bob", "John"]
    table1["age"] = [20, 22, 19, 23, 21]
    table1["sex"] = ["male", "male", "female", "male", "female"]
	#根据stu_id排序
    table1.sort_values(by="stu_id", ascending=True, inplace=True)

	#同table1
    table2 = pd.DataFrame(columns=cols)

    table2["stu_id"] = [13, 34, 15, 22, 3]
    table2["name"] = ["Tom", "Bob", "John", "Jim", "Alice"]
    table2["age"] = [20, 23, 21, 22, 19]
    table2["sex"] = ["male", "male", "female", "male", "female"]

    table2.sort_values(by="stu_id", ascending=True, inplace=True)

	#判断结果
    flag = True
    for i in range(0, 5):
        for head in table1.columns:
            if table1[head][i] != table2[head][i]:
                flag = False
                print("The data in the [{}] column is different".format(head))
                print("index: " + str(i) + " with data:" + str(table1[head][i]) + " " + str(table2[head][i]))
    if flag:
        print("The data in the table is the same")

明明只是顺序不同的数据，根据stu_id排序后本应相同了，可是结果就是错的，如下：

#其实不看也可以
The data in the [stu_id] column is different
index: 1 with data:22 34
The data in the [name] column is different
index: 1 with data:Jim Bob
The data in the [age] column is different
index: 1 with data:22 23
The data in the [stu_id] column is different
index: 2 with data:3 15
The data in the [name] column is different
index: 2 with data:Alice John
The data in the [age] column is different
index: 2 with data:19 21
The data in the [stu_id] column is different
index: 3 with data:34 22
The data in the [name] column is different
index: 3 with data:Bob Jim
The data in the [age] column is different
index: 3 with data:23 22
The data in the [stu_id] column is different
index: 4 with data:15 3
The data in the [name] column is different
index: 4 with data:John Alice
The data in the [age] column is different
index: 4 with data:21 19

二、原因

仔细看了下对不上的行数，都是排序后更换位置的数据。
打印一下table1和table2,发现：

table1:
   stu_id   name  age     sex
2       3  Alice   19  female
0      13    Tom   20    male
4      15   John   21  female
1      22    Jim   22    male
3      34    Bob   23    male
table2:
   stu_id   name  age     sex
4       3  Alice   19  female
0      13    Tom   20    male
2      15   John   21  female
3      22    Jim   22    male
1      34    Bob   23    male

可以看到DataFrame的默认排序id还在，并且两行排序id并不相同（因为本来顺序就不同）。猜测遍历时使用的id顺序还是原id,也就是说sort_values的排序没有生效。

那么我们只要把df默认的id顺序重新改掉就好了。

三、解决办法

在遍历数据之前添加代码：

table1=table1.reset_index(drop=True)
table2=table2.reset_index(drop=True)

这个意思是丢弃原先的index,但是因为df就是会自动生成index，所以丢弃后它会重新排序，变成：

table1:
   stu_id   name  age     sex
0       3  Alice   19  female
1      13    Tom   20    male
2      15   John   21  female
3      22    Jim   22    male
4      34    Bob   23    male
table2:
   stu_id   name  age     sex
0       3  Alice   19  female
1      13    Tom   20    male
2      15   John   21  female
3      22    Jim   22    male
4      34    Bob   23    male