DataComPy 超级好用的，用于比较两个Pandas DataFrame的程序包

Lan.W

已于 2022-10-23 14:19:54 修改

阅读量5.4k

点赞数

分类专栏： python

于 2022-10-23 11:45:29 首次发布

本文链接：https://blog.csdn.net/llanyw/article/details/127472542

版权

pandas python 数据分析 datacompy

python 专栏收录该内容

45 篇文章 3 订阅

订阅专栏

官方文档：

DataComPy — datacompy 0.8.2 documentationhttps://capitalone.github.io/datacompy/index.html要求：

这2个DataFrame的列是完全一样，列数与列名。否则会报错：

KeyError: 'xxxxx_match'

如是列名不一样可以先重命名列名再比较。然后根据列名取出比较的结果，返回的都是DataFrame

df1.columns=[1,2,3,4,5]
df2.columns=[1,2,3,4,5]

dd = datacompy.Compare(df1, df2, join_columns="1")  # '1' 是列名
print(dd.report()) #打印所有比较结果

print('---- 2----')
diff_per = dd.sample_mismatch('2')  # 
print(diff_per)
print('---- 3----')
diff_per1 = dd.sample_mismatch('3')
print(diff_per1)  # 取出列3不相同的数据，返回 dataframe

print('---- 4-----')
diff_per2 = dd.sample_mismatch('4')
print(diff_per)  # 取出不相同的某一列数据，返回 dataframe
print('---- df1 新增-----')
print(dd.df1_unq_rows)
print('---- df2 新增-----')
print(dd.df2_unq_rows)

dd.report() 结果：

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

DataFrame Columns Rows
0 df1 5 53
1 df2 5 50

Column Summary
--------------

Number of columns in common: 5
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0

Row Summary
-----------

Matched on: 1
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 43
Number of rows in df1 but not in df2: 10
Number of rows in df2 but not in df1: 7

Number of rows with some compared columns unequal: 8
Number of rows with all compared columns equal: 35

Column Comparison
-----------------

Number of columns compared with some values unequal: 4
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 13

Columns with Unequal Values or Types
------------------------------------

Column df1 dtype df2 dtype # Unequal Max Diff # Null Diff
0 2 object object 3 0.0 1
1 3 float64 float64 4 25.0 1
2 4 object object 3 0.0 1
3 5 object object 3 0.0 1

Sample Rows with Unequal Values
-------------------------------

1 2 (df1) 2 (df2)

...

Sample Rows Only in df1 (First 10 Columns)
------------------------------------------

1 2 3 4 5
...

Sample Rows Only in df2 (First 10 Columns)
------------------------------------------

1 2 3 4 5

datacompy.Compare()参数：

# Compare 参数：
# df1: 数据框1
# df2: 数据框2
# join_columns: 指定索引的列名，默认“None”，可以传入数组，比如：['ID', 'Name']
# on_index: 是否要开启索引，开启之后不需要指定 join_columns，默认“False”
# abs_tol: 绝对公差，默认“0”
# rel_tal: 相对公差，默认“0”
# df1_name: 报告中数据框1的名字，默认“df1”
# df2_name: 报告中数据框2的名字，默认“df2”
# ignore_spaces: 是否忽略空格，默认“False”
# ignore_case: 是否忽略大小写，默认“False”

问题：生成df2不在df1的数据，row的索引值与原始表对不上。

解决：如想获取到真正的原表的df2不在df1的行索引。可以反转df1,df2的比较顺序再做一次比较，

dd2 = datacompy.Compare(df2, df1, join_columns="1") # '1' 是列名

然后再打印：

print('---- df2 新增的行-----')
print(dd2.df1_unq_rows) # 依然是用:df1_unq_rows, 按入参顺序。比较结果第一个df2会在datacompy里面的值为df1

Lan.W

关注

0
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录