Pandas学习笔记(八)

合并与连接

该文章内容为《Pandas数据分析实战》的学习笔记

导入本文需要的包和数据文件

import pandas as pd
groups1 = pd.read_csv("pandas-in-action-master/chapter_10_merging_joining_and_concatenating/meetup/groups1.csv")
groups1.head()

groups2 = pd.read_csv("pandas-in-action-master/chapter_10_merging_joining_and_concatenating/meetup/groups2.csv")
groups2.head()

categories = pd.read_csv("pandas-in-action-master/chapter_10_merging_joining_and_concatenating/meetup/categories.csv")
categories.head()

# 导入cities数据集
pd.read_csv("pandas-in-action-master/chapter_10_merging_joining_and_concatenating/meetup/cities.csv")
# 我们可以看到第一个数据的zip列有问题,应该是07093,但pandas在导入是自动省略了0,需要我们设置导入格式
cities = pd.read_csv(
    "pandas-in-action-master/chapter_10_merging_joining_and_concatenating/meetup/cities.csv",
    dtype = {"zip" : "string"}
)
cities.head()

连接数据集

将groups2中的行连接到groups1的末尾

pd.concat(objs = [groups1, groups2])

Pandas保留了连接中两个DataFrame的原始索引标签,这样会造成出现相同的索引编号。我们可以像ignore_index参数传递一个True值,使新生成的DateFrame有标准的数字索引

pd.concat(objs = [groups1, groups2], ignore_index=True)

如果既想要创建一个不重复的索引,又想要保留数据来自哪个DataFrame。我们可以使用keys参数将一个字符串列表传递给它

pd.concat(objs = [groups1, groups2], ignore_index=True, keys = ["G1", "G2"])

存储

groups = pd.concat([groups1, groups2], ignore_index=True)
groups.head()

连接后的DataFrame中的缺失值

如果我们在合并的两个数据集中,其中一个含有另一个没有的列,那么在合并之后将会用NaN来填充

sports_champions_A = pd.DataFrame(
    data=[
        ["New England Patriots", "Houston Astros"],
        ["Philadelphia Eagles",  "Boston Red Sox"]
    ],
    columns=["Football", "Baseball"],
    index=[2017, 2018]
)
sports_champions_A.head()

sports_champions_B = pd.DataFrame(
    data=[
        ["New England Patriots", "St. Louis Blues"],
        ["Kansas City Chiefs",   "Tampa Bay Lightning"]
    ],
    columns=["Football", "Hockey"],
    index=[2019, 2020]
)
sports_champions_B.head()

pd.concat([sports_champions_A, sports_champions_B])

左连接

左连接是使用一个数据集的键从另一个数据集提取值。当一个数据集作为分析焦点时左连接是最佳选择

groups.head()
categories.head()

下面我们对groups执行左连接,可以使用merge方法,第一个参数传入数据,how参数传入如何连接这里我们选择left,第三个参数选择以什么作为匹配

groups.merge(categories, how = "left", on = "category_id")

内连接

内连接表示同时出现在两个DataFrame中的值

groups.merge(categories, how = "inner", on = "category_id")

外连接

外连接将两个数据集的所有记录组合在一起。Pandas包含所有值,无论它们属于其中一个数据集,还是同时属于两个数据集

groups.merge(categories, how = "outer", on = "category_id")

我们可以将参数indicator设定为True,可以显示一个值属于哪个DataFrame。合并后的DataFrame将包含一个_merge列,值为both、left_only或right_only

roups.merge(categories, how = "outer", on = "category_id", indicator = True)

我们可以在此基础上只提取城市ID存在于cities中的那些行

outer_join = groups.merge(cities, how = "outer", left_on = "city_id", right_on = "id", indicator = True)
in_right_only = outer_join["_merge"] == "right_only"
outer_join[in_right_only].head()

通过索引合并

我们可以通过将left_index和right_index两个参数设置为True来查找匹配的标签合并

cities.head()
cities = cities.set_index("id")
groups.head()
groups.merge(cities,
             how = "left",
             left_on="city_id",
             right_index = True)

代码挑战

本例共用到四个 CSV 文件:

  • week_1_sales.csvweek_2_sales.csv
    结构相同,记录餐厅在第 1 周和第 2 周的订单数据。包含以下两列:

    • Customer ID:顾客编号,外键,连接到顾客信息表(customers.csv
    • Food ID:菜品编号,外键,连接到菜品信息表(foods.csv
  • customers.csv
    存放顾客主数据,读取时以 ID 为索引列。包含以下字段:

    • First Name:顾客名字
    • Last Name:顾客姓氏
    • Gender:性别
    • Company:工作单位
    • Occupation:职业
  • foods.csv
    存放菜品信息,读取时以 Food ID 为索引列。包含以下字段:

    • Food Item:菜品名称
    • Price:单价(美元)
week1 = pd.read_csv("pandas-in-action-master/chapter_10_merging_joining_and_concatenating/restaurant/week_1_sales.csv")
week2 = pd.read_csv("pandas-in-action-master/chapter_10_merging_joining_and_concatenating/restaurant/week_2_sales.csv")
customers = pd.read_csv("pandas-in-action-master/chapter_10_merging_joining_and_concatenating/restaurant/customers.csv",
                        index_col = "ID")
foods = pd.read_csv("pandas-in-action-master/chapter_10_merging_joining_and_concatenating/restaurant/foods.csv",
                    index_col = "Food ID")

本章的挑战问题如下:

(1) 将两周的销售数据连接到一个 DataFrame 中。为 week1 DataFrame 分配 “Week 1”的键, 为 week2 DataFrame 分配 “Week 2”的键。

(2) 找到第一周和第二周都在餐厅吃饭的顾客。

(3) 找到第一周和第二周都在餐厅用餐,并每周选择相同食物的顾客。
提示:可以将列的列表传递给 on 参数,在多列上对数据集进行连接。

(4) 确定哪些客户仅在第一周或仅在第二周来餐厅用餐。

(5) week1 DataFrame 中的每一行都标示了购买食品的客户。对于每一行,从客户 DataFrame 中提取具体客户的信息。

解决方案
# 1
pd.concat(objs = [week2, week1], keys = ["Week 1", "Week 2"])

# 2
week1.merge(right = week2, on = "Customer ID", how = "inner")

# 3
week1.merge(right = week2, on = ["Customer ID", 'Food ID'], how = "inner")

# 4
temp = week1.merge(right = week2, on = "Customer ID", how = "outer", indicator=True)
temp[temp["_merge"] != "both"]

# 5
week1.merge(right = customers, left_on = "Customer ID", right_index = True, how = "left")

关注公众号小辛到处学,发送1,获取文中的数据资源

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值