合并与连接
该文章内容为《Pandas数据分析实战》的学习笔记
导入本文需要的包和数据文件
import pandas as pd
groups1 = pd.read_csv("pandas-in-action-master/chapter_10_merging_joining_and_concatenating/meetup/groups1.csv")
groups1.head()
groups2 = pd.read_csv("pandas-in-action-master/chapter_10_merging_joining_and_concatenating/meetup/groups2.csv")
groups2.head()
categories = pd.read_csv("pandas-in-action-master/chapter_10_merging_joining_and_concatenating/meetup/categories.csv")
categories.head()
# 导入cities数据集
pd.read_csv("pandas-in-action-master/chapter_10_merging_joining_and_concatenating/meetup/cities.csv")
# 我们可以看到第一个数据的zip列有问题,应该是07093,但pandas在导入是自动省略了0,需要我们设置导入格式
cities = pd.read_csv(
"pandas-in-action-master/chapter_10_merging_joining_and_concatenating/meetup/cities.csv",
dtype = {"zip" : "string"}
)
cities.head()
连接数据集
将groups2中的行连接到groups1的末尾
pd.concat(objs = [groups1, groups2])
Pandas保留了连接中两个DataFrame的原始索引标签,这样会造成出现相同的索引编号。我们可以像ignore_index参数传递一个True值,使新生成的DateFrame有标准的数字索引
pd.concat(objs = [groups1, groups2], ignore_index=True)
如果既想要创建一个不重复的索引,又想要保留数据来自哪个DataFrame。我们可以使用keys参数将一个字符串列表传递给它
pd.concat(objs = [groups1, groups2], ignore_index=True, keys = ["G1", "G2"])
存储
groups = pd.concat([groups1, groups2], ignore_index=True)
groups.head()
连接后的DataFrame中的缺失值
如果我们在合并的两个数据集中,其中一个含有另一个没有的列,那么在合并之后将会用NaN来填充
sports_champions_A = pd.DataFrame(
data=[
["New England Patriots", "Houston Astros"],
["Philadelphia Eagles", "Boston Red Sox"]
],
columns=["Football", "Baseball"],
index=[2017, 2018]
)
sports_champions_A.head()
sports_champions_B = pd.DataFrame(
data=[
["New England Patriots", "St. Louis Blues"],
["Kansas City Chiefs", "Tampa Bay Lightning"]
],
columns=["Football", "Hockey"],
index=[2019, 2020]
)
sports_champions_B.head()
pd.concat([sports_champions_A, sports_champions_B])
左连接
左连接是使用一个数据集的键从另一个数据集提取值。当一个数据集作为分析焦点时左连接是最佳选择
groups.head()
categories.head()
下面我们对groups执行左连接,可以使用merge方法,第一个参数传入数据,how参数传入如何连接这里我们选择left,第三个参数选择以什么作为匹配
groups.merge(categories, how = "left", on = "category_id")
内连接
内连接表示同时出现在两个DataFrame中的值
groups.merge(categories, how = "inner", on = "category_id")
外连接
外连接将两个数据集的所有记录组合在一起。Pandas包含所有值,无论它们属于其中一个数据集,还是同时属于两个数据集
groups.merge(categories, how = "outer", on = "category_id")
我们可以将参数indicator设定为True,可以显示一个值属于哪个DataFrame。合并后的DataFrame将包含一个_merge列,值为both、left_only或right_only
roups.merge(categories, how = "outer", on = "category_id", indicator = True)
我们可以在此基础上只提取城市ID存在于cities中的那些行
outer_join = groups.merge(cities, how = "outer", left_on = "city_id", right_on = "id", indicator = True)
in_right_only = outer_join["_merge"] == "right_only"
outer_join[in_right_only].head()
通过索引合并
我们可以通过将left_index和right_index两个参数设置为True来查找匹配的标签合并
cities.head()
cities = cities.set_index("id")
groups.head()
groups.merge(cities,
how = "left",
left_on="city_id",
right_index = True)
代码挑战
本例共用到四个 CSV 文件:
-
week_1_sales.csv
和week_2_sales.csv
结构相同,记录餐厅在第 1 周和第 2 周的订单数据。包含以下两列:Customer ID
:顾客编号,外键,连接到顾客信息表(customers.csv
)Food ID
:菜品编号,外键,连接到菜品信息表(foods.csv
)
-
customers.csv
存放顾客主数据,读取时以ID
为索引列。包含以下字段:First Name
:顾客名字Last Name
:顾客姓氏Gender
:性别Company
:工作单位Occupation
:职业
-
foods.csv
存放菜品信息,读取时以Food ID
为索引列。包含以下字段:Food Item
:菜品名称Price
:单价(美元)
week1 = pd.read_csv("pandas-in-action-master/chapter_10_merging_joining_and_concatenating/restaurant/week_1_sales.csv")
week2 = pd.read_csv("pandas-in-action-master/chapter_10_merging_joining_and_concatenating/restaurant/week_2_sales.csv")
customers = pd.read_csv("pandas-in-action-master/chapter_10_merging_joining_and_concatenating/restaurant/customers.csv",
index_col = "ID")
foods = pd.read_csv("pandas-in-action-master/chapter_10_merging_joining_and_concatenating/restaurant/foods.csv",
index_col = "Food ID")
本章的挑战问题如下:
(1) 将两周的销售数据连接到一个 DataFrame 中。为 week1 DataFrame 分配 “Week 1”的键, 为 week2 DataFrame 分配 “Week 2”的键。
(2) 找到第一周和第二周都在餐厅吃饭的顾客。
(3) 找到第一周和第二周都在餐厅用餐,并每周选择相同食物的顾客。
提示:可以将列的列表传递给 on 参数,在多列上对数据集进行连接。
(4) 确定哪些客户仅在第一周或仅在第二周来餐厅用餐。
(5) week1 DataFrame 中的每一行都标示了购买食品的客户。对于每一行,从客户 DataFrame 中提取具体客户的信息。
解决方案
# 1
pd.concat(objs = [week2, week1], keys = ["Week 1", "Week 2"])
# 2
week1.merge(right = week2, on = "Customer ID", how = "inner")
# 3
week1.merge(right = week2, on = ["Customer ID", 'Food ID'], how = "inner")
# 4
temp = week1.merge(right = week2, on = "Customer ID", how = "outer", indicator=True)
temp[temp["_merge"] != "both"]
# 5
week1.merge(right = customers, left_on = "Customer ID", right_index = True, how = "left")
关注公众号小辛到处学,发送1,获取文中的数据资源