13. Pandas怎样实现DataFrame的Merge

Pandas怎样实现DataFrame的Merge

Pandas的Merge,相当于Sql的Join,将不同的表按key关联到一个表

merge的语法:

pd.merge(left, right, how=‘inner’, on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=(’_x’, ‘_y’), copy=True, indicator=False,
validate=None)

  • left,right:要merge的dataframe或者有name的Series
  • how:join类型,‘left’, ‘right’, ‘outer’, ‘inner’
  • on:join的key,left和right都需要有这个key
  • left_on:left的df或者series的key
  • right_on:right的df或者seires的key
  • left_index,right_index:使用index而不是普通的column做join
  • suffixes:两个元素的后缀,如果列有重名,自动添加后缀,默认是(’_x’, ‘_y’)

文档地址:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

本次讲解提纲:

  1. 电影数据集的join实例
  2. 理解merge时一对一、一对多、多对多的数量对齐关系
  3. 理解left join、right join、inner join、outer join的区别
  4. 如果出现非Key的字段重名怎么办

1、电影数据集的join实例

电影评分数据集

是推荐系统研究的很好的数据集
位于本代码目录:./datas/movielens-1m

包含三个文件:

  1. 用户对电影的评分数据 ratings.dat
  2. 用户本身的信息数据 users.dat
  3. 电影本身的数据 movies.dat

可以关联三个表,得到一个完整的大表

数据集官方地址:https://grouplens.org/datasets/movielens/

import pandas as pd
df_ratings = pd.read_csv(
    "./datas/movielens-1m/ratings.dat", 
    sep="::",
    engine='python', 
    names="UserID::MovieID::Rating::Timestamp".split("::")
)
df_ratings.head()
UserIDMovieIDRatingTimestamp
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291
df_users = pd.read_csv(
    "./datas/movielens-1m/users.dat", 
    sep="::",
    engine='python', 
    names="UserID::Gender::Age::Occupation::Zip-code".split("::")
)
df_users.head()
UserIDGenderAgeOccupationZip-code
01F11048067
12M561670072
23M251555117
34M45702460
45M252055455
df_movies = pd.read_csv(
    "./datas/movielens-1m/movies.dat", 
    sep="::",
    engine='python', 
    names="MovieID::Title::Genres".split("::")
)
df_movies.head()
MovieIDTitleGenres
01Toy Story (1995)Animation|Children's|Comedy
12Jumanji (1995)Adventure|Children's|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
df_ratings_users = pd.merge(
   df_ratings, df_users, left_on="UserID", right_on="UserID", how="inner"
)
df_ratings_users.head()
UserIDMovieIDRatingTimestampGenderAgeOccupationZip-code
0111935978300760F11048067
116613978302109F11048067
219143978301968F11048067
3134084978300275F11048067
4123555978824291F11048067
df_ratings_users_movies = pd.merge(
    df_ratings_users, df_movies, left_on="MovieID", right_on="MovieID", how="inner"
)
df_ratings_users_movies.head(10)
UserIDMovieIDRatingTimestampGenderAgeOccupationZip-codeTitleGenres
0111935978300760F11048067One Flew Over the Cuckoo's Nest (1975)Drama
1211935978298413M561670072One Flew Over the Cuckoo's Nest (1975)Drama
21211934978220179M251232793One Flew Over the Cuckoo's Nest (1975)Drama
31511934978199279M25722903One Flew Over the Cuckoo's Nest (1975)Drama
41711935978158471M50195350One Flew Over the Cuckoo's Nest (1975)Drama
51811934978156168F18395825One Flew Over the Cuckoo's Nest (1975)Drama
61911935982730936M11048073One Flew Over the Cuckoo's Nest (1975)Drama
72411935978136709F25710023One Flew Over the Cuckoo's Nest (1975)Drama
82811933978125194F25114607One Flew Over the Cuckoo's Nest (1975)Drama
93311935978557765M45355421One Flew Over the Cuckoo's Nest (1975)Drama

2、理解merge时数量的对齐关系

以下关系要正确理解:

  • one-to-one:一对一关系,关联的key都是唯一的
    • 比如(学号,姓名) merge (学号,年龄)
    • 结果条数为:1*1
  • one-to-many:一对多关系,左边唯一key,右边不唯一key
    • 比如(学号,姓名) merge (学号,[语文成绩、数学成绩、英语成绩])
    • 结果条数为:1*N
  • many-to-many:多对多关系,左边右边都不是唯一的
    • 比如(学号,[语文成绩、数学成绩、英语成绩]) merge (学号,[篮球、足球、乒乓球])
    • 结果条数为:M*N
2.1 one-to-one 一对一关系的merge
left = pd.DataFrame({'sno': [11, 12, 13, 14],
                      'name': ['name_a', 'name_b', 'name_c', 'name_d']
                    })
left
snoname
011name_a
112name_b
213name_c
314name_d
right = pd.DataFrame({'sno': [11, 12, 13, 14],
                      'age': ['21', '22', '23', '24']
                    })
right
snoage
01121
11222
21323
31424
# 一对一关系,结果中有4条
pd.merge(left, right, on='sno')
snonameage
011name_a21
112name_b22
213name_c23
314name_d24
2.2 one-to-many 一对多关系的merge

注意:数据会被复制

left = pd.DataFrame({'sno': [11, 12, 13, 14],
                      'name': ['name_a', 'name_b', 'name_c', 'name_d']
                    })
left
snoname
011name_a
112name_b
213name_c
314name_d
right = pd.DataFrame({'sno': [11, 11, 11, 12, 12, 13],
                       'grade': ['语文88', '数学90', '英语75','语文66', '数学55', '英语29']
                     })
right
snograde
011语文88
111数学90
211英语75
312语文66
412数学55
513英语29
# 数目以多的一边为准
pd.merge(left, right, on='sno')
snonamegrade
011name_a语文88
111name_a数学90
211name_a英语75
312name_b语文66
412name_b数学55
513name_c英语29
2.3 many-to-many 多对多关系的merge

注意:结果数量会出现乘法

left = pd.DataFrame({'sno': [11, 11, 12, 12,12],
                      '爱好': ['篮球', '羽毛球', '乒乓球', '篮球', "足球"]
                    })
left
sno爱好
011篮球
111羽毛球
212乒乓球
312篮球
412足球
right = pd.DataFrame({'sno': [11, 11, 11, 12, 12, 13],
                       'grade': ['语文88', '数学90', '英语75','语文66', '数学55', '英语29']
                     })
right
snograde
011语文88
111数学90
211英语75
312语文66
412数学55
513英语29
pd.merge(left, right, on='sno')
sno爱好grade
011篮球语文88
111篮球数学90
211篮球英语75
311羽毛球语文88
411羽毛球数学90
511羽毛球英语75
612乒乓球语文66
712乒乓球数学55
812篮球语文66
912篮球数学55
1012足球语文66
1112足球数学55

3、理解left join、right join、inner join、outer join的区别

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                      'C': ['C0', 'C1', 'C4', 'C5'],
                      'D': ['D0', 'D1', 'D4', 'D5']})
left
keyAB
0K0A0B0
1K1A1B1
2K2A2B2
3K3A3B3
right
keyCD
0K0C0D0
1K1C1D1
2K4C4D4
3K5C5D5
3.1 inner join,默认

左边和右边的key都有,才会出现在结果里

pd.merge(left, right, how='inner')
keyABCD
0K0A0B0C0D0
1K1A1B1C1D1
3.2 left join

左边的都会出现在结果里,右边的如果无法匹配则为Null

pd.merge(left, right, how='left')
keyABCD
0K0A0B0C0D0
1K1A1B1C1D1
2K2A2B2NaNNaN
3K3A3B3NaNNaN
3.3 right join

右边的都会出现在结果里,左边的如果无法匹配则为Null

pd.merge(left, right, how='right')
keyABCD
0K0A0B0C0D0
1K1A1B1C1D1
2K4NaNNaNC4D4
3K5NaNNaNC5D5
3.4 outer join

左边、右边的都会出现在结果里,如果无法匹配则为Null

pd.merge(left, right, how='outer')
keyABCD
0K0A0B0C0D0
1K1A1B1C1D1
2K2A2B2NaNNaN
3K3A3B3NaNNaN
4K4NaNNaNC4D4
5K5NaNNaNC5D5

4、如果出现非Key的字段重名怎么办

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                      'A': ['A10', 'A11', 'A12', 'A13'],
                      'D': ['D0', 'D1', 'D4', 'D5']})
left
keyAB
0K0A0B0
1K1A1B1
2K2A2B2
3K3A3B3
right
keyAD
0K0A10D0
1K1A11D1
2K4A12D4
3K5A13D5
pd.merge(left, right, on='key')
keyA_xBA_yD
0K0A0B0A10D0
1K1A1B1A11D1
pd.merge(left, right, on='key', suffixes=('_left', '_right'))
keyA_leftBA_rightD
0K0A0B0A10D0
1K1A1B1A11D1

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值