20-Pandas concat连接操作

Pandas concat连接操作(堆叠合并数据)

Pandas 通过 concat() 函数能够轻松地将 Series 与 DataFrame 对象组合在一起,函数的语法格式如下:

 pd.concat(objs,axis=0,join='outer',join_axes=None,ignore_index=False)

参数说明如下所示:

参数名称说明
objs一个序列或者是Series、DataFrame对象。
axis表示在哪个轴方向上(行或者列)进行连接操作,默认 axis=0 表示行方向。
join指定连接方式,取值为{“inner”,“outer”},默认为 outer 表示取并集,inner代表取交集。
ignore_index布尔值参数,默认为 False,如果为 True,表示不在连接的轴上使用索引。
join_axes表示索引对象的列表。

为了便于说明,我们这里还是使用之前的数据:empdata.csv; 并将其分割成多个DataFrame:

import pandas as pd

df_emp_info  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')

df1 = df_emp_info.iloc[:9,:]
df2 = df_emp_info.iloc[9:,:]
print("前9行数据:\n",df1)
# 重设行号,且不保留原索引
print("后5行数据:\n",df2.reset_index(drop=True))

运行结果:

前9行数据:
    EMPNO   ENAME        JOB     MGR    HIREDATE   SAL    COMM  DEPTNO
0   7369   SMITH      CLERK  7902.0  1980-12-17   800     NaN      20
1   7499   ALLEN   SALESMAN  7698.0  1981-02-20  1600   300.0      30
2   7521    WARD   SALESMAN  7698.0  1981-02-22  1250   500.0      30
3   7566   JONES    MANAGER  7839.0  1981-04-02  2975     NaN      20
4   7654  MARTIN   SALESMAN  7698.0  1981-09-28  1250  1400.0      30
5   7698   BLAKE    MANAGER  7839.0  1981-05-01  2850     NaN      30
6   7782   CLARK    MANAGER  7839.0  1981-06-09  2450     NaN      10
7   7788   SCOTT    ANALYST  7566.0  1987-04-19  3000     NaN      20
8   7839    KING  PRESIDENT     NaN  1981-11-17  5000     NaN      10
后5行数据:
    EMPNO   ENAME       JOB     MGR    HIREDATE   SAL  COMM  DEPTNO
0   7844  TURNER  SALESMAN  7698.0  1981-09-08  1500   0.0      30
1   7876   ADAMS     CLERK  7788.0  1987-05-23  1100   NaN      20
2   7900   JAMES     CLERK  7698.0  1981-12-03   950   NaN      30
3   7902    FORD   ANALYST  7566.0  1981-12-03  3000   NaN      20
4   7934  MILLER     CLERK  7782.0  1982-01-23  1300   NaN      10

concat()

concat() 函数用于沿某个特定的轴执行连接操作。下面让我们创建不同的对象,并对其进行连接。

1)列名完全一致的纵向堆叠(合并)
import pandas as pd

df_emp_info  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')

df1 = df_emp_info.iloc[:9,:]
df2 = df_emp_info.iloc[9:,:].reset_index(drop=True)

print("列名完全一致时,纵向堆叠:\n",pd.concat([df1,df2]))

输出结果:

列名完全一致时,纵向堆叠:
    EMPNO   ENAME        JOB     MGR    HIREDATE   SAL    COMM  DEPTNO
0   7369   SMITH      CLERK  7902.0  1980-12-17   800     NaN      20
1   7499   ALLEN   SALESMAN  7698.0  1981-02-20  1600   300.0      30
2   7521    WARD   SALESMAN  7698.0  1981-02-22  1250   500.0      30
3   7566   JONES    MANAGER  7839.0  1981-04-02  2975     NaN      20
4   7654  MARTIN   SALESMAN  7698.0  1981-09-28  1250  1400.0      30
5   7698   BLAKE    MANAGER  7839.0  1981-05-01  2850     NaN      30
6   7782   CLARK    MANAGER  7839.0  1981-06-09  2450     NaN      10
7   7788   SCOTT    ANALYST  7566.0  1987-04-19  3000     NaN      20
8   7839    KING  PRESIDENT     NaN  1981-11-17  5000     NaN      10
0   7844  TURNER   SALESMAN  7698.0  1981-09-08  1500     0.0      30
1   7876   ADAMS      CLERK  7788.0  1987-05-23  1100     NaN      20
2   7900   JAMES      CLERK  7698.0  1981-12-03   950     NaN      30
3   7902    FORD    ANALYST  7566.0  1981-12-03  3000     NaN      20
4   7934  MILLER      CLERK  7782.0  1982-01-23  1300     NaN      10

如果想把指定的键与 DataFrame 对象连接,您可以使用 keys 参数来实现。如下所示:

import pandas as pd

df_emp_info  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')

df1 = df_emp_info.iloc[:9,:]
df2 = df_emp_info.iloc[9:,:].reset_index(drop=True)

print("列名完全一致时,纵向堆叠:\n",pd.concat([df1,df2],keys=range(1,9)))

输出结果:

列名完全一致时,纵向堆叠:
      EMPNO   ENAME        JOB     MGR    HIREDATE   SAL    COMM  DEPTNO
1 0   7369   SMITH      CLERK  7902.0  1980-12-17   800     NaN      20
  1   7499   ALLEN   SALESMAN  7698.0  1981-02-20  1600   300.0      30
  2   7521    WARD   SALESMAN  7698.0  1981-02-22  1250   500.0      30
  3   7566   JONES    MANAGER  7839.0  1981-04-02  2975     NaN      20
  4   7654  MARTIN   SALESMAN  7698.0  1981-09-28  1250  1400.0      30
  5   7698   BLAKE    MANAGER  7839.0  1981-05-01  2850     NaN      30
  6   7782   CLARK    MANAGER  7839.0  1981-06-09  2450     NaN      10
  7   7788   SCOTT    ANALYST  7566.0  1987-04-19  3000     NaN      20
  8   7839    KING  PRESIDENT     NaN  1981-11-17  5000     NaN      10
2 0   7844  TURNER   SALESMAN  7698.0  1981-09-08  1500     0.0      30
  1   7876   ADAMS      CLERK  7788.0  1987-05-23  1100     NaN      20
  2   7900   JAMES      CLERK  7698.0  1981-12-03   950     NaN      30
  3   7902    FORD    ANALYST  7566.0  1981-12-03  3000     NaN      20
  4   7934  MILLER      CLERK  7782.0  1982-01-23  1300     NaN      10

上述示中,可以看出行索引 index 存在重复使用的现象,如果想让输出的行索引遵循依次递增的规则,那么需要将 ignore_index 设置为 True。

import pandas as pd

df_emp_info  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')

df1 = df_emp_info.iloc[:9,:]
df2 = df_emp_info.iloc[9:,:].reset_index(drop=True)

print("列名完全一致时,纵向堆叠:\n",pd.concat([df1,df2],keys=range(1,9),ignore_index=True))

输出结果:

列名完全一致时,纵向堆叠:
     EMPNO   ENAME        JOB     MGR    HIREDATE   SAL    COMM  DEPTNO
0    7369   SMITH      CLERK  7902.0  1980-12-17   800     NaN      20
1    7499   ALLEN   SALESMAN  7698.0  1981-02-20  1600   300.0      30
2    7521    WARD   SALESMAN  7698.0  1981-02-22  1250   500.0      30
3    7566   JONES    MANAGER  7839.0  1981-04-02  2975     NaN      20
4    7654  MARTIN   SALESMAN  7698.0  1981-09-28  1250  1400.0      30
5    7698   BLAKE    MANAGER  7839.0  1981-05-01  2850     NaN      30
6    7782   CLARK    MANAGER  7839.0  1981-06-09  2450     NaN      10
7    7788   SCOTT    ANALYST  7566.0  1987-04-19  3000     NaN      20
8    7839    KING  PRESIDENT     NaN  1981-11-17  5000     NaN      10
9    7844  TURNER   SALESMAN  7698.0  1981-09-08  1500     0.0      30
10   7876   ADAMS      CLERK  7788.0  1987-05-23  1100     NaN      20
11   7900   JAMES      CLERK  7698.0  1981-12-03   950     NaN      30
12   7902    FORD    ANALYST  7566.0  1981-12-03  3000     NaN      20
13   7934  MILLER      CLERK  7782.0  1982-01-23  1300     NaN      10

注意:此时的索引顺序被改变了,而且键 keys 指定的键也被覆盖了。

2)列名不完全一致的纵向堆叠(合并)
import pandas as pd

df_emp_info  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')

df1 = df_emp_info.iloc[:9,1:]
df2 = df_emp_info.iloc[9:,:6].reset_index(drop=True)

print("列名不完全一致时,纵向堆叠:\n",pd.concat([df1,df2],keys=range(1,9),ignore_index=True))

运行结果:(发现缺失的列值为NaN)

列名完全一致时,纵向堆叠:
      ENAME        JOB     MGR    HIREDATE   SAL    COMM  DEPTNO   EMPNO
0    SMITH      CLERK  7902.0  1980-12-17   800     NaN    20.0     NaN
1    ALLEN   SALESMAN  7698.0  1981-02-20  1600   300.0    30.0     NaN
2     WARD   SALESMAN  7698.0  1981-02-22  1250   500.0    30.0     NaN
3    JONES    MANAGER  7839.0  1981-04-02  2975     NaN    20.0     NaN
4   MARTIN   SALESMAN  7698.0  1981-09-28  1250  1400.0    30.0     NaN
5    BLAKE    MANAGER  7839.0  1981-05-01  2850     NaN    30.0     NaN
6    CLARK    MANAGER  7839.0  1981-06-09  2450     NaN    10.0     NaN
7    SCOTT    ANALYST  7566.0  1987-04-19  3000     NaN    20.0     NaN
8     KING  PRESIDENT     NaN  1981-11-17  5000     NaN    10.0     NaN
9   TURNER   SALESMAN  7698.0  1981-09-08  1500     NaN     NaN  7844.0
10   ADAMS      CLERK  7788.0  1987-05-23  1100     NaN     NaN  7876.0
11   JAMES      CLERK  7698.0  1981-12-03   950     NaN     NaN  7900.0
12    FORD    ANALYST  7566.0  1981-12-03  3000     NaN     NaN  7902.0
13  MILLER      CLERK  7782.0  1982-01-23  1300     NaN     NaN  7934.0

3)横向堆叠(合并)

为了演示,我们这里还是使用之前的数据:empdata.csv; 并将其左右分割成多个DataFrame:

import pandas as pd

df_emp_info  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')

df1 = df_emp_info.iloc[:,:5]
df2 = df_emp_info.iloc[:,4:]
print("前五列数据:\n",df1)
print("后四列数据:\n",df2)

运行结果:

前五列数据:
     EMPNO   ENAME        JOB     MGR    HIREDATE
0    7369   SMITH      CLERK  7902.0  1980-12-17
1    7499   ALLEN   SALESMAN  7698.0  1981-02-20
2    7521    WARD   SALESMAN  7698.0  1981-02-22
3    7566   JONES    MANAGER  7839.0  1981-04-02
4    7654  MARTIN   SALESMAN  7698.0  1981-09-28
5    7698   BLAKE    MANAGER  7839.0  1981-05-01
6    7782   CLARK    MANAGER  7839.0  1981-06-09
7    7788   SCOTT    ANALYST  7566.0  1987-04-19
8    7839    KING  PRESIDENT     NaN  1981-11-17
9    7844  TURNER   SALESMAN  7698.0  1981-09-08
10   7876   ADAMS      CLERK  7788.0  1987-05-23
11   7900   JAMES      CLERK  7698.0  1981-12-03
12   7902    FORD    ANALYST  7566.0  1981-12-03
13   7934  MILLER      CLERK  7782.0  1982-01-23
后三列数据:
       HIREDATE   SAL    COMM  DEPTNO
0   1980-12-17   800     NaN      20
1   1981-02-20  1600   300.0      30
2   1981-02-22  1250   500.0      30
3   1981-04-02  2975     NaN      20
4   1981-09-28  1250  1400.0      30
5   1981-05-01  2850     NaN      30
6   1981-06-09  2450     NaN      10
7   1987-04-19  3000     NaN      20
8   1981-11-17  5000     NaN      10
9   1981-09-08  1500     0.0      30
10  1987-05-23  1100     NaN      20
11  1981-12-03   950     NaN      30
12  1981-12-03  3000     NaN      20
13  1982-01-23  1300     NaN      10

如果您想要沿着 axis=1 添加两个对象,那么将会追加新的列。

import pandas as pd

df_emp_info  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')

df1 = df_emp_info.iloc[:,:5]
df2 = df_emp_info.iloc[:,4:]
pd.set_option("display.max_columns",60)

print("列名不完全一致时,纵向堆叠:\n",pd.concat([df1,df2],axis=1))

输出结果:(注意结果太长,这里给换行了)

列名不完全一致时,纵向堆叠:
     EMPNO   ENAME        JOB     MGR    HIREDATE    HIREDATE   SAL    COMM  \
0    7369   SMITH      CLERK  7902.0  1980-12-17  1980-12-17   800     NaN   
1    7499   ALLEN   SALESMAN  7698.0  1981-02-20  1981-02-20  1600   300.0   
2    7521    WARD   SALESMAN  7698.0  1981-02-22  1981-02-22  1250   500.0   
3    7566   JONES    MANAGER  7839.0  1981-04-02  1981-04-02  2975     NaN   
4    7654  MARTIN   SALESMAN  7698.0  1981-09-28  1981-09-28  1250  1400.0   
5    7698   BLAKE    MANAGER  7839.0  1981-05-01  1981-05-01  2850     NaN   
6    7782   CLARK    MANAGER  7839.0  1981-06-09  1981-06-09  2450     NaN   
7    7788   SCOTT    ANALYST  7566.0  1987-04-19  1987-04-19  3000     NaN   
8    7839    KING  PRESIDENT     NaN  1981-11-17  1981-11-17  5000     NaN   
9    7844  TURNER   SALESMAN  7698.0  1981-09-08  1981-09-08  1500     0.0   
10   7876   ADAMS      CLERK  7788.0  1987-05-23  1987-05-23  1100     NaN   
11   7900   JAMES      CLERK  7698.0  1981-12-03  1981-12-03   950     NaN   
12   7902    FORD    ANALYST  7566.0  1981-12-03  1981-12-03  3000     NaN   
13   7934  MILLER      CLERK  7782.0  1982-01-23  1982-01-23  1300     NaN   

    DEPTNO  
0       20  
1       30  
2       30  
3       20  
4       30  
5       30  
6       10  
7       20  
8       10  
9       30  
10      20  
11      30  
12      20  
13      10 

以上结果不难看出,HIREDATE列出现了两次

append():纵向堆叠(合并)

如果要连接 Series 和 DataFrame 对象,有一个最方便、快捷的方法,那就是 append() 方法。该方法沿着 axis=0 (行方向)进行操作。

append()纵向堆叠要求:两张表的列名必须完全一致。新版本无法使用

import pandas as pd
a= pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])

b= pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D1', 'D2', 'D5', 'D6']},
                     index=[4,5,6,7])
#沿着 axis=0,使用 apppend()方法连接a与b
print(a.append(b))

输出结果:

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
4  A4  B4  C4  D1
5  A5  B5  C5  D2
6  A6  B6  C6  D5
7  A7  B7  C7  D6

当然 append() 函数也可接收多个对象,示例如下:

a= pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])

b= pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D1', 'D2', 'D5', 'D6']},
                     index=[4,5,6,7])
c= pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B8', 'B9', 'B10', 'B7'],
                    'C': ['C9', 'C8', 'C7', 'C6'],
                    'D': ['D8', 'D5', 'D7', 'D6']},
                     index=[8,9,10,11])
print(a.append(b,c,a))

输出结果:

    A    B   C   D
0   A0   B0  C0  D0
1   A1   B1  C1  D1
2   A2   B2  C2  D2
3   A3   B3  C3  D3
4   A4   B4  C4  D1
5   A5   B5  C5  D2
6   A6   B6  C6  D5
7   A7   B7  C7  D6
8   A4   B8  C9  D8
9   A5   B9  C8  D5
10  A6   B10 C7  D7
11  A7   B7  C6  D6
0   A0   B0  C0  D0
1   A1   B1  C1  D1
2   A2   B2  C2  D2
3   A3   B3  C3  D3
  • 21
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值