GroupBy 对象
pandas 库的 GroupBy 对象是一个存储容器,用于将 DataFrame 行分组到存储桶中。它提供了一组方法来聚合和分析集合中的每个独立组。它允许我们提取每个组中特定索引位置的行。它还提供了一种迭代行组的便捷方法。
import pandas as pd
从头开始创建 GroupBy 对象
我们首先创建一个 DataFrame 来存储超市中水果和蔬菜的价格:
>>> import pandas as pd
>>> food_data = {
... "Item": ["Banana", "Cucumber", "Orange", "Tomato", "Watermelon"],
... "Type": ["Fruit", "Vegetable", "Fruit", "Vegetable", "Fruit"],
... "Price": [0.99, 1.25, 0.25, 0.33, 3.00],
... }
>>> supermarket = pd.DataFrame(data=food_data)
>>> supermarket
Item Type Price
0 Banana Fruit 0.99
1 Cucumber Vegetable 1.25
2 Orange Fruit 0.25
3 Tomato Vegetable 0.33
4 Watermelon Fruit 3.00
>>>
GroupBy 对象根据列中的共享值将 DataFrame 行组织到存储桶中。假设我们对水果的平均价格和蔬菜的平均价格感兴趣。如果我们可以将“水果”行和“蔬菜”行分离到不同的组中,则执行计算会更容易。
让我们首先在 supermarket DataFrame 上调用 groupby
方法。我们需要向它传递一个列,pandas 将使用该列的值来创建组,该方法返回一个 GroupBy
。GroupBy
对象与 DataFrame 是独立且不同的:
>>> groups = supermarket.groupby("Type")
>>> groups
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000013B98496290>
>>>
get_group
方法接受组名称并返回包含相应行的 DataFrame。
>>> groups.get_group("Fruit")
Item Type Price
0 Banana Fruit 0.99
2 Orange Fruit 0.25
4 Watermelon Fruit 3.00
>>> groups.get_group("Vegetable")
Item Type Price
1 Cucumber Vegetable 1.25
3 Tomato Vegetable 0.33
>>>
GroupBy 对象擅长聚合操作,比如很容易计算每组数值列的均值。
>>> groups.mean(numeric_only=True)
Price
Type
Fruit 1.413333
Vegetable 0.790000
>>>
从数据集中创建 GroupBy 对象
Fortune1000.csv
文件是 2018 年财富 1000 强公司的集合。每行包括公司名称、收入、利润、员工人数、领域和行业:
>>> fortune = pd.read_csv('../data/fortune1000.csv')
>>> fortune
Company Revenues Profits Employees Sector Industry
0 Walmart 500343.0 9862.0 2300000 Retailing General Merchandisers
1 Exxon Mobil 244363.0 19710.0 71200 Energy Petroleum Refining
2 Berkshire Hathaway 242137.0 44940.0 377000 Financials Insurance: Property and Casualty (Stock)
3 Apple 229234.0 48351.0 123000 Technology Computers, Office Equipment
4 UnitedHealth Group 201159.0 10558.0 260000 Health Care Health Care: Insurance and Managed Care
.. ... ... ... ... ... ...
995 SiteOne Landscape Supply 1862.0 54.6 3664 Wholesalers Wholesalers: Diversified
996 Charles River Laboratories Intl 1858.0 123.4 11800 Health Care Health Care: Pharmacy and Other Services
997 CoreLogic 1851.0 152.2 5900 Business Services Financial Data Services
998 Ensign Group 1849.0 40.5 21301 Health Care Health Care: Medical Facilities
999 HCP 1848.0 414.2 190 Financials Real estate
[1000 rows x 6 columns]
>>>
一个领域可以有很多公司。例如,Apple 和 Amazon.com 都属于 “Technology” 行业。行业是领域内的子类别。例如,“Pipelines” 和 “Petroleum Refining” 行业属于 “Energy” 行业。
加入想要知道某个领域的平均收入,可以这样:
>>> retailing_mask = fortune["Sector"] == "Retailing"
>>> retail_companies = fortune[retailing_mask]
>>> retailing_mask.head()
0 True
1 False
2 False
3 False
4 False
Name: Sector, dtype: bool
>>> retail_companies["Revenues"].head()
0 500343.0
7 177866.0
14 129025.0
22 100904.0
38 71879.0
Name: Revenues, dtype: float64
>>> retail_companies["Revenues"].mean()
21874.714285714286
>>>
虽然这在计算某一个单一的领域时是合理的,但是如果领域很多就不得不使用 for
循环遍历全部的领域。pandas 为我们提供了开箱即用的函数 groupby
。
>>> sectors = fortune.groupby("Sector")
>>> sectors
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000013B9A217A10>
>>>
GroupBy 对象由一个或一组 DataFrame 组成。在底层,pandas 重复了我们对 “Retail” 使用的提取过程,但对 “Sector” 列中的所有 21 个值进行了提取。我们可以通过将 GroupBy 对象传递给 Python 的内置 len
函数来计算 sectors
中的组数。sectors
GroupBy 对象有 21 个 DataFrame。 这个数字等于 Fortune 的 Sector 列中唯一值的数量,我们可以通过调用 nunique
方法来发现:
>>> len(sectors)
21
>>> fortune["Sector"].nunique()
21
>>>
GroupBy 对象上的 size
方法返回一个 Series,其中包含按字母顺序排列的组及其行数列表:
>>> sectors.size()
Sector
Aerospace & Defense 25
Apparel 14
Business Services 53
Chemicals 33
Energy 107
Engineering & Construction 27
Financials 155
Food & Drug Stores 12
Food, Beverages & Tobacco 37
Health Care 71
Hotels, Restaurants & Leisure 26
Household Products 28
Industrials 49
Materials 45
Media 25
Motor Vehicles & Parts 19
Retailing 77
Technology 103
Telecommunications 10
Transportation 40
Wholesalers 44
dtype: int64
>>>
GroupBy 对象的属性和方法
groups
属性存储具有这些组到行关联的字典:
>>> sectors.groups
{'Aerospace & Defense': [26, 50, 58, 98, 117, 118, 207, 224, 275, 380
# .. 省略一些输出 ..
GroupBy 对象的 first
方法提取每个组的第一行,如果数据本身是排序的,那么可以获取最大值或者最小值:
>>> sectors.first()
Company Revenues Profits Employees Industry
Sector
Aerospace & Defense Boeing 93392.0 8197.0 140800 Aerospace and Defense
Apparel Nike 34350.0 4240.0 74400 Apparel
Business Services ManpowerGroup 21034.0 545.4 29000 Temporary Help
Chemicals DowDuPont 62683.0 1460.0 98000 Chemicals
Energy Exxon Mobil 244363.0 19710.0 71200 Petroleum Refining
Engineering & Construction Fluor 19521.0 191.4 56706 Engineering, Construction
Financials Berkshire Hathaway 242137.0 44940.0 377000 Insurance: Property and Casualty (Stock)
Food & Drug Stores Kroger 122662.0 1907.0 449000 Food and Drug Stores
Food, Beverages & Tobacco PepsiCo 63525.0 4857.0 263000 Food Consumer Products
Health Care UnitedHealth Group 201159.0 10558.0 260000 Health Care: Insurance and Managed Care
Hotels, Restaurants & Leisure Marriott International 22894.0 1372.0 177000 Hotels, Casinos, Resorts
Household Products Procter & Gamble 66217.0 15326.0 95000 Household and Personal Products
Industrials General Electric 122274.0 -5786.0 313000 Industrial Machinery
Materials International Paper 23302.0 2144.0 56000 Packaging, Containers
Media Disney 55137.0 8980.0 199000 Entertainment
Motor Vehicles & Parts General Motors 157311.0 -3864.0 180000 Motor Vehicles and Parts
Retailing Walmart 500343.0 9862.0 2300000 General Merchandisers
Technology Apple 229234.0 48351.0 123000 Computers, Office Equipment
Telecommunications AT&T 160546.0 29450.0 254000 Telecommunications
Transportation UPS 65872.0 4910.0 346415 Mail, Package, and Freight Delivery
Wholesalers McKesson 198533.0 5070.0 64500 Wholesalers: Health Care
>>>
同样的 last
方法可以提取每个组的最后一条数据。
nth
方法可以根据每个组内的隐式索引进行提取。每个组的隐式索引与原本被分组的数组不同,在每个组内,隐式索引都是从 0 开始的。可以指定一个负数,表示从后往前。
sectors.nth(0)
可以获得每个组内的第一条数据,这与 first
的作用是一样的。
head
与 tail
方法与 DataFrame 的同名方法作用相同,表示前 n 行和后 n 行,默认为 5。
使用 get_group()
可以提取给定组的全部行:
>>> sectors.get_group("Apparel")
Company Revenues Profits Employees Sector Industry
88 Nike 34350.0 4240.0 74400 Apparel Apparel
241 VF 12400.0 614.9 69000 Apparel Apparel
331 PVH 8915.0 537.8 28050 Apparel Apparel
420 Ralph Lauren 6653.0 -99.3 18250 Apparel Apparel
432 Hanesbrands 6478.0 61.9 67200 Apparel Apparel
526 Under Armour 4977.0 -48.3 11350 Apparel Apparel
529 Levi Strauss 4904.0 281.4 13800 Apparel Apparel
554 Tapestry 4488.3 591.0 12450 Apparel Apparel
587 Skechers U.S.A. 4181.0 179.2 8150 Apparel Apparel
678 Carters 3400.0 302.8 20900 Apparel Apparel
766 G-III Apparel Group 2807.0 62.1 9961 Apparel Apparel
774 Fossil Group 2788.0 -478.2 12300 Apparel Apparel
835 Columbia Sportswear 2466.0 105.1 6188 Apparel Apparel
861 Wolverine World Wide 2350.0 0.3 3700 Apparel Apparel
>>>
聚合操作
我们可以调用 GroupBy 对象上的方法来将聚合操作应用于每个组。
>>> sectors.sum(numeric_only=True).head(3)
Revenues Profits Employees
Sector
Aerospace & Defense 383835.0 26733.5 1010124
Apparel 101157.3 6350.7 355699
Business Services 316090.0 37179.2 1593999
>>>
通过一次 sum 方法调用,pandas 将计算逻辑应用于 GroupBy 对象中的每个嵌套 DataFrame。GroupBy 对象支持很多其他的聚合方法。
>>> sectors.mean(numeric_only=True).head()
Revenues Profits Employees
Sector
Aerospace & Defense 15353.400000 1069.340000 40404.960000
Apparel 7225.521429 453.621429 25407.071429
Business Services 5963.962264 701.494340 30075.452830
Chemicals 7610.636364 620.454545 14364.242424
Energy 14425.300935 805.373585 9170.158879
>>>
我们可以通过在 GroupBy 对象后面的方括号内传递其名称来定位单列。Pandas 返回一个新对象,一个 SeriesGroupBy:
>>> sectors["Revenues"]
<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000013B9A24E350>
>>>
在底层,DataFrameGroupBy 对象存储 SeriesGroupBy 对象的集合。
max
方法返回给定列的最大值。同样的 min
方法返回给定列中的最小值。
>>> sectors["Revenues"].max().head(1)
Sector
Aerospace & Defense 93392.0
Name: Revenues, dtype: float64
>>> sectors["Revenues"].min().head(1)
Sector
Aerospace & Defense 1877.0
Name: Revenues, dtype: float64
>>>
agg
方法对不同的列应用多个聚合操作,并接受字典作为其参数。在每个键值对中,键表示 DataFrame 列,值指定要应用于该列的聚合操作。示例提取每个领域的最低收入、最高利润和平均员工人数:
>>> aggregations = {"Revenues": "min", "Profits": "max", "Employees": "mean"}
>>> sectors.agg(aggregations).head()
Revenues Profits Employees
Sector
Aerospace & Defense 1877.0 8197.0 40404.960000
Apparel 2350.0 4240.0 25407.071429
Business Services 1851.0 6699.0 30075.452830
Chemicals 1925.0 3000.4 14364.242424
Energy 1874.0 19710.0 9170.158879
>>>
Pandas 返回一个 DataFrame,其中聚合字典的键作为列名。用于分组的变量仍作为索引,变量名作为索引名。
将自定义操作应用于所有组
假设我们想要对 GroupBy 对象中的每个嵌套组应用自定义操作。 DataFrame 的 nlargest
方法提取给定列中具有最大值的行:
>>> fortune.nlargest(5, columns="Profits")
Company Revenues Profits Employees Sector Industry
3 Apple 229234.0 48351.0 123000 Technology Computers, Office Equipment
2 Berkshire Hathaway 242137.0 44940.0 377000 Financials Insurance: Property and Casualty (Stock)
15 Verizon 126034.0 30101.0 155400 Telecommunications Telecommunications
8 AT&T 160546.0 29450.0 254000 Telecommunications Telecommunications
19 JPMorgan Chase 113899.0 24441.0 252539 Financials Commercial Banks
>>>
如果我们可以在 sectors
中的每个嵌套 DataFrame 上调用 nlargest
方法,我们就会得到每个领域收入最高的公司。
我们可以在这里使用 GroupBy 对象的 apply
方法。该方法需要一个函数作为参数。它为 GroupBy 对象中的每个组调用该函数一次。然后它收集函数调用的返回值并在新的 DataFrame 中返回它们。
>>> def get_largest_row(df):
... return df.nlargest(1, "Revenues")
...
>>> sectors.apply(get_largest_row).head()
Company Revenues Profits Employees Sector Industry
Sector
Aerospace & Defense 26 Boeing 93392.0 8197.0 140800 Aerospace & Defense Aerospace and Defense
Apparel 88 Nike 34350.0 4240.0 74400 Apparel Apparel
Business Services 142 ManpowerGroup 21034.0 545.4 29000 Business Services Temporary Help
Chemicals 46 DowDuPont 62683.0 1460.0 98000 Chemicals Chemicals
Energy 1 Exxon Mobil 244363.0 19710.0 71200 Energy Petroleum Refining
>>>
如果自定义的函数需要传入参数,那么可以直接在 apply
中为参数赋值:
>>> def get_largest_row(df, columns="Revenues"):
... return df.nlargest(1, columns=columns)
...
>>>
>>> sectors.apply(get_largest_row, columns="Profits").head()
Company Revenues Profits Employees Sector Industry
Sector
Aerospace & Defense 26 Boeing 93392.0 8197.0 140800 Aerospace & Defense Aerospace and Defense
Apparel 88 Nike 34350.0 4240.0 74400 Apparel Apparel
Business Services 160 Visa 18358.0 6699.0 15000 Business Services Financial Data Services
Chemicals 344 Air Products & Chemicals 8442.0 3000.4 15150 Chemicals Chemicals
Energy 1 Exxon Mobil 244363.0 19710.0 71200 Energy Petroleum Refining
>>>
如果需要对组内进行多函数的运算,可以使用 lambda
函数进行操作:
>>> def get_max(df, columns):
... return df[columns].max()
...
>>>
>>> def get_min(df, columns):
... return df[columns].min()
...
>>>
>>> sectors.apply(
... lambda x: pd.Series(
... {
... "Profits": get_max(x, columns="Profits"),
... "Revenues": get_min(x, columns="Revenues"),
... }
... )
... )
Profits Revenues
Sector
Aerospace & Defense 8197.0 1877.0
Apparel 4240.0 2350.0
Business Services 6699.0 1851.0
Chemicals 3000.4 1925.0
Energy 19710.0 1874.0
Engineering & Construction 1038.4 1906.0
Financials 44940.0 1848.0
Food & Drug Stores 4078.0 2064.0
Food, Beverages & Tobacco 10999.0 2071.0
Health Care 21308.0 1849.0
Hotels, Restaurants & Leisure 5192.3 1907.0
Household Products 15326.0 1949.0
Industrials 4858.0 1918.0
Materials 2144.0 1933.0
Media 8980.0 1867.0
Motor Vehicles & Parts 7602.0 2066.0
Retailing 9862.0 1870.0
Technology 48351.0 1871.0
Telecommunications 30101.0 2200.0
Transportation 10712.0 1943.0
Wholesalers 5070.0 1862.0
>>>
>>> def get_value(df, columns):
... return (get_min(df, columns), get_max(df, columns))
...
>>>
>>> sectors.apply(get_value, columns="Profits").head()
Sector
Aerospace & Defense (-74.0, 8197.0)
Apparel (-478.2, 4240.0)
Business Services (-557.1, 6699.0)
Chemicals (-296.2, 3000.4)
Energy (-5723.0, 19710.0)
dtype: object
>>>
当 pandas 不支持你想要应用于每个嵌套组的自定义聚合时,请使用 apply 方法。
按多列分组
我们可以使用多个 DataFrame 列的值创建一个 GroupBy 对象,实现按照多列对 DataFrame 进行分组。
>>> sectors_and_industry = fortune.groupby(by=["Sector", "Industry"])
>>>
GroupBy 对象的 size
方法现在返回一个 MultiIndexSeries
,其中包含每个内部组的行数。这个 GroupBy 对象的长度为 82,这意味着数据集中有 82 种不同的领域和行业组合。
>>> sectors_and_industry.size()
Sector Industry
Aerospace & Defense Aerospace and Defense 25
Apparel Apparel 14
Business Services Advertising, marketing 2
Diversified Outsourcing Services 14
Education 2
..
Transportation Trucking, Truck Leasing 11
Wholesalers Wholesalers: Diversified 24
Wholesalers: Electronics and Office Equipment 8
Wholesalers: Food and Grocery 6
Wholesalers: Health Care 6
Length: 82, dtype: int64
>>>
get_group
方法需要一个值元组来从 GroupBy 集合中提取嵌套的 DataFrame,这里提取领域为 “Business Services” 并且行业为 “Education” 的公司。
>>> sectors_and_industry.get_group(("Business Services", "Education"))
Company Revenues Profits Employees Sector Industry
567 Laureate Education 4378.0 91.5 54500 Business Services Education
810 Graham Holdings 2592.0 302.0 16153 Business Services Education
>>>
对于所有聚合,pandas 都会返回带有计算结果的 MultiIndexDataFrame
。
>>> sectors_and_industry.sum(numeric_only=True).head()
Revenues Profits Employees
Sector Industry
Aerospace & Defense Aerospace and Defense 383835.0 26733.5 1010124
Apparel Apparel 101157.3 6350.7 355699
Business Services Advertising, marketing 23156.0 1667.4 127500
Diversified Outsourcing Services 74175.0 5043.7 858600
Education 6970.0 393.5 70653
>>>
如果想要对单列进行操作,可以在 GroupBy 对象后面的方括号中输入列,然后调用聚合方法。
>>> sectors_and_industry["Revenues"].mean().head(5)
Sector Industry
Aerospace & Defense Aerospace and Defense 15353.400000
Apparel Apparel 7225.521429
Business Services Advertising, marketing 11578.000000
Diversified Outsourcing Services 5298.214286
Education 3485.000000
Name: Revenues, dtype: float64
>>>
总之,GroupBy 对象是用于拆分、组织和聚合 DataFrame 值的最佳数据结构。如果你需要使用多个列来标识存储桶,向 groupby
方法传递一个列名组成列表。