1 相同索引连接数据集 : concat\append
1.1 添加行
concat 函数
import pandas as pd
#添加几个数据集
df1 = pd.read_csv('D://postgraduatestudies/py/mypythonstudy/gapdata1.csv',encoding="utf-8",sep=',')
df2 = pd.read_csv('D://postgraduatestudies/py/mypythonstudy/gapdata1-1.csv',encoding="utf-8",sep=',')
df3 = pd.read_csv('D://postgraduatestudies/py/mypythonstudy/gapdata1-1-1.csv',encoding="utf-8",sep=',')
print(df1)
print(df2)
print(df3)
print('---------------------')
#concat函数把DataFrame 连接起来
all_df = pd.concat([df1,df2,df3])
print(all_df)
'''
a b c d
0 bbc cgf bf 248
1 fsdf fgd dv 654
2 tyd RTRT yiu 46
3 GEG hb km 98
0 v x nh m
1 fg x j m
2 g d v g
0 fg m v f
1 v v v m
'''
连接后的数据,行名只是原始行索引的简单堆叠。
使用 iloc 与 loc 查看某行,iloc只会显示第一个重复行名,对比:
print(all_df.iloc[2, ])
print(all_df.loc[2, ])
'''
a tyd
b RTRT
c yiu
d 46
Name: 2, dtype: object
---------------------
a b c d
2 tyd RTRT yiu 46
2 g d v g
'''
将 Series 追加到 DataFrame 结果会出错,因为类型全不匹配,所以会创建不对齐的 新列
#新建一行Series数据
new_df_series = pd.Series(['N','E','W','A'])
print(new_df_series)
print(pd.concat([df1,new_df_series]))
'''
0
0 N
1 E
2 W
3 A
a b c d 0
0 bbc cgf bf 248.0 NaN
1 fsdf fgd dv 654.0 NaN
2 tyd RTRT yiu 46.0 NaN
3 GEG hb km 98.0 NaN
0 NaN NaN NaN NaN N
1 NaN NaN NaN NaN E
2 NaN NaN NaN NaN W
3 NaN NaN NaN NaN A
'''
所以要将新列转化为 DataFrame 并且使用 columns 参数指定数据要绑定的列名
new_df_series = pd.DataFrame([['N','E','W','A']],
columns = ['a','b','c','d']
)
print(new_df_series)
print(pd.concat([df1,new_df_series]))
'''
a b c d
0 N E W A
a b c d
0 bbc cgf bf 248
1 fsdf fgd dv 654
2 tyd RTRT yiu 46
3 GEG hb km 98
0 N E W A
'''
append 函数同样可以追加一个对象
print(df1.append(df2))
print(df1.append(new_df_series))
python 字典可以添加数据行
ignore_index = True 索引,不重复先前的索引值-简单连续追加
new_df = { 'a':'N',
'b':'E',
'c':'W',
'd':'A',}
print(df1.append(new_df,ignore_index=True))
'''
a b c d
0 bbc cgf bf 248
1 fsdf fgd dv 654
2 tyd RTRT yiu 46
3 GEG hb km 98
4 N E W A
'''
1.2 添加列
同添加行相似,concat 函数中 属性axis默认为 0,添加将其设为 1 即为添加列
all_df = pd.concat([df1,df2,df3],axis=1)
print(all_df)
'''
a b c d a b c d a b c d
0 bbc cgf bf 248 v x nh m fg m v f
1 fsdf fgd dv 654 fg x j m v v v m
2 tyd RTRT yiu 46 g d v g NaN NaN NaN NaN
3 GEG hb km 98 NaN NaN NaN NaN NaN NaN NaN NaN
'''
获取列子集,同样会显示所有列名相同的列
print(all_df['a'])
添加一列,不需要转化为 DataFrame,只需提供一个列名与值就可以
df1['newlist']=['N','E','W','A']
print(df1)
df2['newlist']=pd.Series(['N','E','W','A'])
print(df2)
'''
a b c d newlist
0 bbc cgf bf 248 N
1 fsdf fgd dv 654 E
2 tyd RTRT yiu 46 W
3 GEG hb km 98 A
---------------------
a b c d newlist
0 v x nh m N
1 fg x j m E
2 g d v g W
'''
ignore_index = True 参数同样重置索引,列名按数字顺序排序
all_df = pd.concat([df1,df2,df3],axis=1,ignore_index=True)
print(all_df)
'''
0 1 2 3 4 5 6 7 8 9 10 11
0 bbc cgf bf 248 v x nh m fg m v f
1 fsdf fgd dv 654 fg x j m v v v m
2 tyd RTRT yiu 46 g d v g NaN NaN NaN NaN
3 GEG hb km 98 NaN NaN NaN NaN NaN NaN NaN NaN
'''
2 不同索引下连接操作
2.1 连接不同列名的行
先修改把 df1、df2、df3 列名更改为不同
后使用 concat 连接,发现不同列名会新创建,相同列会自动对齐,缺失NaN填充
df1.columns = ['a','b','c','d']
df2.columns = ['e','f','g','h']
df3.columns = ['a','c','e','g']
all_df = pd.concat([df1,df2,df3])
print(all_df)
'''
a b c d e f g h
0 bbc cgf bf 248.0 NaN NaN NaN NaN
1 fsdf fgd dv 654.0 NaN NaN NaN NaN
2 tyd RTRT yiu 46.0 NaN NaN NaN NaN
3 GEG hb km 98.0 NaN NaN NaN NaN
0 NaN NaN NaN NaN v x nh m
1 NaN NaN NaN NaN fg x j m
2 NaN NaN NaN NaN g d v g
0 fg NaN m NaN v NaN f NaN
1 v NaN v NaN v NaN m NaN
'''
concat 的 join参数 (默认outer) 将其设置为 join='inner' 会保留几个数据集共有的列
若没有共有的列,会得到一个空的数据集
all_df_1 = pd.concat([df1,df3],join='inner')
print(all_df_1)
'''
a c
0 bbc bf
1 fsdf dv
2 tyd yiu
3 GEG km
0 fg m
1 v v
'''
2.2 连接不同行名的列
同连接不同列名的行相似
先修改把 df1、df2、df3 行名更改为不同
使用 concat 连接,发现不同行名会新创建,相同行会自动对齐,缺失NaN填充
join 参数设置为 join='inner' 保留几个数据集共有的行,若没有共有的行,得到一个空的数据集
df1.index = ['a','b','c','d']
df2.index = ['e','f','g']
df3.index = ['a','e']
all_df = pd.concat([df1,df2,df3],axis=1)
print(all_df)
all_df_1 = pd.concat([df1,df3],axis=1,join='inner')
print(all_df_1)
'''
a b c d a b c d a b c d
a bbc cgf bf 248.0 NaN NaN NaN NaN fg m v f
b fsdf fgd dv 654.0 NaN NaN NaN NaN NaN NaN NaN NaN
c tyd RTRT yiu 46.0 NaN NaN NaN NaN NaN NaN NaN NaN
d GEG hb km 98.0 NaN NaN NaN NaN NaN NaN NaN NaN
e NaN NaN NaN NaN v x nh m v v v m
f NaN NaN NaN NaN fg x j m NaN NaN NaN NaN
g NaN NaN NaN NaN g d v g NaN NaN NaN NaN
---------------------
a b c d a b c d
a bbc cgf bf 248 fg m v f
'''
3 合并多个数据集 : merge
data_new = 数据集A.merge ( 被调用的数据集B , left_on='A中与B列名不同的名' , right_on='B匹配的列名' )
3.1 一对一合并
连接的列不含任何重复值,先分离出来
data_df1 = df1.loc[[0,2], ]
print(data_df1)
print(df2)
data_new = df2.merge(data_df1,left_on='a',right_on='name')
print(data_new)
'''
name age country score year
0 wang 18 zaozhuang 28.8 1999
2 zhao 23 linyi 24.2 1997
a b c d e
0 wang h 45 28.8 1999
1 zhao h 5 24.2 1997
2 zh d 21 45.0 1997
3 li d 33 234.0 1998
4 zhang n 64 34.4 2000
---------------------
a b c d e name age country score year
0 wang h 45 28.8 1999 wang 18 zaozhuang 28.8 1999
1 zhao h 5 24.2 1997 zhao 23 linyi 24.2 1997
'''
3.2 多对一合并
不需要分离出来使用子集,直接用就是多对一
print(df1)
print(df2)
data_new = df2.merge(df1,left_on='a',right_on='name')
print(data_new)
'''
name age country score year
0 wang 18 zaozhuang 28.80 1999
1 sun 18 yantai 30.23 2000
2 zhao 23 linyi 24.20 1997
3 li 22 weihai 35.90 1998
4 zhang 21 jining 34.40 2000
5 ai 12 taian 32.20 1996
a b c d e
0 wang h 45 28.8 1999
1 zhao h 5 24.2 1997
2 zh d 21 45.0 1997
3 li d 33 234.0 1998
4 zhang n 64 34.4 2000
---------------------
a b c d e name age country score year
0 wang h 45 28.8 1999 wang 18 zaozhuang 28.8 1999
1 zhao h 5 24.2 1997 zhao 23 linyi 24.2 1997
2 li d 33 234.0 1998 li 22 weihai 35.9 1998
3 zhang n 64 34.4 2000 zhang 21 jining 34.4 2000
'''
3.3 多对多合并
把要匹配的列以列表形式传入(根据哪几个列进行合并)
print(df1)
print(df2)
data_new = df2.merge(df1,left_on=['a','d','e'],right_on=['name','score','year'])
print(data_new)
'''
---------------------
a b c d e name age country score year
0 wang h 45 28.8 1999 wang 18 zaozhuang 28.8 1999
1 zhao h 5 24.2 1997 zhao 23 linyi 24.2 1997
2 zhang n 64 34.4 2000 zhang 21 jining 34.4 2000
'''
如果两个数据集要匹配的列名相同该列会合并,如果列名相同但是内容不匹配,会自动向列名添加后缀(_x来自左数据集 , _y 来自右数据集)