用pandas高效合并文本文件
1. 需求
需要合并两个文本文件。
order.txt内容如下:
NOLdYDMtIkAdDSsH,2017-02-09 10:40:46,BTEQRSDBTNYIQPAP,合肥,221,104069.75,6244.1849999999995,0.06
rmOIDtwINpKkjOGz,2010-09-26 23:10:00,ITFMPULUFPJUPEOA,长春,188,34236.68,1369.4672,0.04
KBTiWDDaTGIXvKrQ,2008-12-28 19:41:52,YXZVIZYLQSHZRQBQ,哈尔滨,173,30557.41,5194.7597000000005,0.17
qjEBtJNwMgNwZtIw,2021-11-07 16:31:28,LYAFXBGBHMIHEXEA,宁波,348,144349.06,4330.471799999999,0.03
UERkBEDXXGMaIAdZ,2022-07-28 22:58:38,GWIXTWWIAPKRRPQE,石家庄,211,85932.44,3437.2976000000003,0.04
jrWglrLvkJxUyaji,2013-05-29 07:16:18,ROWIVUGPFFYWKLSJ,深圳,486,155476.1,7773.805,0.05
detail.txt内容如下:
NOLdYDMtIkAdDSsH,XE-06,494.76,40
NOLdYDMtIkAdDSsH,IY-34,298.15,111
NOLdYDMtIkAdDSsH,PT-83,731.21,70
rmOIDtwINpKkjOGz,IC-68,182.11,188
KBTiWDDaTGIXvKrQ,AH-84,263.27,32
KBTiWDDaTGIXvKrQ,DH-07,156.97,141
qjEBtJNwMgNwZtIw,KX-89,746.68,24
qjEBtJNwMgNwZtIw,HC-84,120.56,107
qjEBtJNwMgNwZtIw,CS-90,381.66,61
qjEBtJNwMgNwZtIw,LQ-64,578.51,156
UERkBEDXXGMaIAdZ,UE-39,341.73,74
UERkBEDXXGMaIAdZ,ZT-25,442.66,137
jrWglrLvkJxUyaji,TW-65,145.89,93
jrWglrLvkJxUyaji,YA-36,122.4,191
jrWglrLvkJxUyaji,FL-73,315.56,25
jrWglrLvkJxUyaji,BQ-49,625.09,177
需要在detail.txt文件中,按第一列的ID,增加main.txt的日期列。
原来都是用openfile ,readline的方法,做循环比较合并,比较复杂。
2.pandas的merge方法
import pandas as pd
# 读取主表文件 ,定义列名
column_names = ['ORDER_ID','ORDER_DATE','CONTRACT_ID','CITY','ORDER_NUM','ORDER_PRICE','ORDER_TAX','ORDER_TAXRATE']
# 设置列名,读取10行数据演示
main_df = pd.read_csv('order.txt', header=None, names=column_names, nrows=10)
# 读取附表文件 ,定义列名
column_names = ['ORDER_ID','PRODUCT_ID','PRODUCT_PRICE','PRODUCT_NUM']
# 设置列名,读取10行数据演示
detail_df = pd.read_csv('detail.txt', header=None, names=column_names, nrows=10)
# 设置索引列
main_df.set_index('ORDER_ID')
detail_df.set_index('ORDER_ID')
# 用merge合并 ,注意 how = right
merged_df = pd.merge(main_df[['ORDER_ID','ORDER_DATE']], detail_df, left_on='ORDER_ID', right_on='ORDER_ID', how='right')
# 保存到txt文件中 ,不用表头和索引
merged_df.to_csv('merged_table.txt', index=False, header=None)
合并后的文件merged_table.txt:
NOLdYDMtIkAdDSsH,2017-02-09 10:40:46,XE-06,494.76,40
NOLdYDMtIkAdDSsH,2017-02-09 10:40:46,IY-34,298.15,111
NOLdYDMtIkAdDSsH,2017-02-09 10:40:46,PT-83,731.21,70
rmOIDtwINpKkjOGz,2010-09-26 23:10:00,IC-68,182.11,188
KBTiWDDaTGIXvKrQ,2008-12-28 19:41:52,AH-84,263.27,32
KBTiWDDaTGIXvKrQ,2008-12-28 19:41:52,DH-07,156.97,141
qjEBtJNwMgNwZtIw,2021-11-07 16:31:28,KX-89,746.68,24
qjEBtJNwMgNwZtIw,2021-11-07 16:31:28,HC-84,120.56,107
qjEBtJNwMgNwZtIw,2021-11-07 16:31:28,CS-90,381.66,61
qjEBtJNwMgNwZtIw,2021-11-07 16:31:28,LQ-64,578.51,156
简洁,流畅!