Pandas 十三:怎样实现DataFrame的Merge

Pandas的Merge,相当于Sql的Join,将不同的表按key关联到一个表

merge的语法:

pd.merge(left, right, how=‘inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=(‘x’, ‘y’), copy=True, indicator=False, validate=None)

  • left,right:要merge的dataframe或者有name的Series
  • how:join类型,‘left’, ‘right’, ‘outer’, ‘inner’
  • on:join的key,left和right都需要有这个key
  • left_on:left的df或者series的key
  • right_on:right的df或者seires的key
  • left_index,right_index:使用index而不是普通的column做join
  • suffixes:两个元素的后缀,如果列有重名,自动添加后缀,默认是(‘x’, ‘y’)

文档地址:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

本次讲解提纲:

1.电影数据集的join实例
2.理解merge时一对一、一对多、多对多的数量对齐关系
3.理解left join、right join、inner join、outer join的区别
4.如果出现非Key的字段重名怎么办

1、电影数据集的join实例

电影评分数据集

是推荐系统研究的很好的数据集
位于本代码目录:./datas/movielens-1m

包含三个文件:

1.用户对电影的评分数据 ratings.dat
2.用户本身的信息数据 users.dat
3.电影本身的数据 movies.dat
可以关联三个表,得到一个完整的大表

数据集官方地址:https://grouplens.org/datasets/movielens/

1

import pandas as pd
No output

2

df_ratings = pd.read_csv(
    "./datas/movielens-1m/ratings.dat", 
    sep="::",
    engine='python', 
    names="UserID::MovieID::Rating::Timestamp".split("::")
)
No output

3

df_ratings.head()

3

UserID	MovieID	Rating	Timestamp
0	1	1193	5	978300760
1	1	661	3	978302109
2	1	914	3	978301968
3	1	3408	4	978300275
4	1	2355	5	978824291

4

df_users = pd.read_csv(
    "./datas/movielens-1m/users.dat", 
    sep="::",
    engine='python', 
    names="UserID::Gender::Age::Occupation::Zip-code".split("::")
)
No output

5

df_users.head()

5

UserID	Gender	Age	Occupation	Zip-code
0	1	F	1	10	48067
1	2	M	56	16	70072
2	3	M	25	15	55117
3	4	M	45	7	02460
4	5	M	25	20	55455

6

df_movies = pd.read_csv(
    "./datas/movielens-1m/movies.dat", 
    sep="::",
    engine='python', 
    names="MovieID::Title::Genres".split("::")
)
No output

7

df_movies.head()

7

MovieID	Title	Genres
0	1	Toy Story (1995)	Animation|Children's|Comedy
1	2	Jumanji (1995)	Adventure|Children's|Fantasy
2	3	Grumpier Old Men (1995)	Comedy|Romance
3	4	Waiting to Exhale (1995)	Comedy|Drama
4	5	Father of the Bride Part II (1995)	Comedy

8

df_ratings_users = pd.merge(
   df_ratings, df_users, left_on="UserID", right_on="UserID", how="inner"
)
No output

9

df_ratings_users.head()

9

UserID	MovieID	Rating	Timestamp	Gender	Age	Occupation	Zip-code
0	1	1193	5	978300760	F	1	10	48067
1	1	661	3	978302109	F	1	10	48067
2	1	914	3	978301968	F	1	10	48067
3	1	3408	4	978300275	F	1	10	48067
4	1	2355	5	978824291	F	1	10	48067

10

df_ratings_users_movies = pd.merge(
    df_ratings_users, df_movies, left_on="MovieID", right_on="MovieID", how="inner"
)
No output

11

df_ratings_users_movies.head(10)

11

UserID	MovieID	Rating	Timestamp	Gender	Age	Occupation	Zip-code	Title	Genres
0	1	1193	5	978300760	F	1	10	48067	One Flew Over the Cuckoo's Nest (1975)	Drama
1	2	1193	5	978298413	M	56	16	70072	One Flew Over the Cuckoo's Nest (1975)	Drama
2	12	1193	4	978220179	M	25	12	32793	One Flew Over the Cuckoo's Nest (1975)	Drama
3	15	1193	4	978199279	M	25	7	22903	One Flew Over the Cuckoo's Nest (1975)	Drama
4	17	1193	5	978158471	M	50	1	95350	One Flew Over the Cuckoo's Nest (1975)	Drama
5	18	1193	4	978156168	F	18	3	95825	One Flew Over the Cuckoo's Nest (1975)	Drama
6	19	1193	5	982730936	M	1	10	48073	One Flew Over the Cuckoo's Nest (1975)	Drama
7	24	1193	5	978136709	F	25	7	10023	One Flew Over the Cuckoo's Nest (1975)	Drama
8	28	1193	3	978125194	F	25	1	14607	One Flew Over the Cuckoo's Nest (1975)	Drama
9	33	1193	5	978557765	M	45	3	55421	One Flew Over the Cuckoo's Nest (1975)	Drama

2、理解merge时数量的对齐关系

以下关系要正确理解:

  • one-to-one:一对一关系,关联的key都是唯一的
    • 比如(学号,姓名) merge (学号,年龄)
    • 结果条数为:1*1
  • one-to-many:一对多关系,左边唯一key,右边不唯一key
    • 比如(学号,姓名) merge (学号,[语文成绩、数学成绩、英语成绩])
    • 结果条数为:1*N
  • many-to-many:多对多关系,左边右边都不是唯一的
    • 比如(学号,[语文成绩、数学成绩、英语成绩]) merge (学号,[篮球、足球、乒乓球])
    • 结果条数为:M*N
2.1 one-to-one 一对一关系的merge

12

left = pd.DataFrame({'sno': [11, 12, 13, 14],
                      'name': ['name_a', 'name_b', 'name_c', 'name_d']
                    })
left

12

sno	name
0	11	name_a
1	12	name_b
2	13	name_c
3	14	name_d

13

right = pd.DataFrame({'sno': [11, 12, 13, 14],
                      'age': ['21', '22', '23', '24']
                    })
right

13

sno	age
0	11	21
1	12	22
2	13	23
3	14	24

14

# 一对一关系,结果中有4条
pd.merge(left, right, on='sno')

14

sno	name	age
0	11	name_a	21
1	12	name_b	22
2	13	name_c	23
3	14	name_d	24
2.2 one-to-many 一对多关系的merge

注意:数据会被复制

15

left = pd.DataFrame({'sno': [11, 12, 13, 14],
                      'name': ['name_a', 'name_b', 'name_c', 'name_d']
                    })
left

15

sno	name
0	11	name_a
1	12	name_b
2	13	name_c
3	14	name_d

16

right = pd.DataFrame({'sno': [11, 11, 11, 12, 12, 13],
                       'grade': ['语文88', '数学90', '英语75','语文66', '数学55', '英语29']
                     })
right

16

sno	grade
0	11	语文88
1	11	数学90
2	11	英语75
3	12	语文66
4	12	数学55
5	13	英语29

17

# 数目以多的一边为准
pd.merge(left, right, on='sno')

17

sno	name	grade
0	11	name_a	语文88
1	11	name_a	数学90
2	11	name_a	英语75
3	12	name_b	语文66
4	12	name_b	数学55
5	13	name_c	英语29
2.3 many-to-many 多对多关系的merge

注意:结果数量会出现乘法

18

left = pd.DataFrame({'sno': [11, 11, 12, 12,12],
                      '爱好': ['篮球', '羽毛球', '乒乓球', '篮球', "足球"]
                    })
left

18

sno	爱好
0	11	篮球
1	11	羽毛球
2	12	乒乓球
3	12	篮球
4	12	足球

19

right = pd.DataFrame({'sno': [11, 11, 11, 12, 12, 13],
                       'grade': ['语文88', '数学90', '英语75','语文66', '数学55', '英语29']
                     })
right

19

sno	grade
0	11	语文88
1	11	数学90
2	11	英语75
3	12	语文66
4	12	数学55
5	13	英语29

20

pd.merge(left, right, on='sno')

20

sno	爱好	grade
0	11	篮球	语文88
1	11	篮球	数学90
2	11	篮球	英语75
3	11	羽毛球	语文88
4	11	羽毛球	数学90
5	11	羽毛球	英语75
6	12	乒乓球	语文66
7	12	乒乓球	数学55
8	12	篮球	语文66
9	12	篮球	数学55
10	12	足球	语文66
11	12	足球	数学55

3、理解left join、right join、inner join、outer join的区别

21

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                      'C': ['C0', 'C1', 'C4', 'C5'],
                      'D': ['D0', 'D1', 'D4', 'D5']})
No output

22

left

22

key	A	B
0	K0	A0	B0
1	K1	A1	B1
2	K2	A2	B2
3	K3	A3	B3

23

right

23

key	C	D
0	K0	C0	D0
1	K1	C1	D1
2	K4	C4	D4
3	K5	C5	D5
3.1 inner join,默认

左边和右边的key都有,才会出现在结果里
24

pd.merge(left, right, how='inner')

24

key	A	B	C	D
0	K0	A0	B0	C0	D0
1	K1	A1	B1	C1	D1
3.2 left join

左边的都会出现在结果里,右边的如果无法匹配则为Null
25

pd.merge(left, right, how='left')

25

key	A	B	C	D
0	K0	A0	B0	C0	D0
1	K1	A1	B1	C1	D1
2	K2	A2	B2	NaN	NaN
3	K3	A3	B3	NaN	NaN
3.3 right join

右边的都会出现在结果里,左边的如果无法匹配则为Null
26

pd.merge(left, right, how='right')

26

key	A	B	C	D
0	K0	A0	B0	C0	D0
1	K1	A1	B1	C1	D1
2	K4	NaN	NaN	C4	D4
3	K5	NaN	NaN	C5	D5
3.4 outer join

左边、右边的都会出现在结果里,如果无法匹配则为Null
27

pd.merge(left, right, how='outer')

27

key	A	B	C	D
0	K0	A0	B0	C0	D0
1	K1	A1	B1	C1	D1
2	K2	A2	B2	NaN	NaN
3	K3	A3	B3	NaN	NaN
4	K4	NaN	NaN	C4	D4
5	K5	NaN	NaN	C5	D5

4、如果出现非Key的字段重名怎么办

28

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                      'A': ['A10', 'A11', 'A12', 'A13'],
                      'D': ['D0', 'D1', 'D4', 'D5']})
No output

29

left

29

key	A	B
0	K0	A0	B0
1	K1	A1	B1
2	K2	A2	B2
3	K3	A3	B3

30

right

30

key	A	D
0	K0	A10	D0
1	K1	A11	D1
2	K4	A12	D4
3	K5	A13	D5

31

pd.merge(left, right, on='key')

31

key	A_x	B	A_y	D
0	K0	A0	B0	A10	D0
1	K1	A1	B1	A11	D1

32

pd.merge(left, right, on='key', suffixes=('_left', '_right'))

32

key	A_left	B	A_right	D
0	K0	A0	B0	A10	D0
1	K1	A1	B1	A11	D1
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值