We covered a lot of ground in Part 1 of our pandas tutorial. We went from the basics of pandas DataFrames to indexing and computations. If you’re still not confident with Pandas, you might want to check out the Dataquest pandas Course.
在熊猫教程的第1部分中,我们介绍了很多内容。 我们从熊猫DataFrames的基础到索引和计算。 如果您对Pandas仍然不满意,则可以查看Dataquest pandas课程 。
In this tutorial, we’ll dive into one of the most powerful aspects of pandas – its grouping and aggregation functionality. With this functionality, it’s dead simple to compute group summary statistics, discover patterns, and slice up your data in various ways.
在本教程中,我们将深入研究熊猫最强大的方面之一-分组和聚合功能。 使用此功能,以各种方式计算组摘要统计信息,发现模式并分割数据非常简单。
Since Thanksgiving was just last week, we’ll use a dataset on what Americans typically eat for Thanksgiving dinner as we explore the pandas library. You can download the dataset here. It contains 1058
online survey responses collected by FiveThirtyEight. Each survey respondent was asked questions about what they typically eat for Thanksgiving, along with some demographic questions, like their gender, income, and location. This dataset will allow us to discover regional and income-based patterns in what Americans eat for Thanksgiving dinner. As we explore the data and try to find patterns, we’ll be heavily using the grouping and aggregation functionality of pandas.
由于感恩节就在上周,因此我们将使用数据集探索美国人在熊猫晚餐时通常吃什么。 您可以在此处下载数据集。 它包含由FiveThirtyEight收集的1058
在线调查答复。 每个受访者都会被问到他们通常在感恩节吃什么的问题,以及一些人口统计学问题,例如他们的性别,收入和位置。 该数据集将使我们能够发现美国人在感恩节大餐中所吃的食物的区域性和基于收入的模式。 当我们探索数据并尝试找到模式时,我们将大量使用熊猫的分组和聚合功能。
We’re very into Thanksgiving dinner in America.
我们非常想参加美国的感恩节晚餐。
Just as a note, we’ll be using Python 3.5 and Jupyter Notebook to do our analysis.
请注意,我们将使用Python 3.5和Jupyter Notebook进行分析。
读入并汇总数据 (Reading in and summarizing the data)
Our first step is to read in the data and do some preliminary exploration. This will help us figure out how we want to approach creating groups and finding patterns.
我们的第一步是读取数据并进行一些初步的探索。 这将帮助我们弄清楚如何创建组和查找模式。
As you may recall from part one of this tutorial, we can read in the data using the pandas.read_csv function. The data is stored using Latin-1
encoding, so we additionally need to specify the encoding
keyword argument. If we don’t, pandas won’t be able to load in the data, and we’ll get an error:
您可能会从本教程的第一部分中回想起,我们可以使用pandas.read_csv函数读取数据。 数据使用Latin-1
编码存储,因此我们还需要指定encoding
关键字参数。 如果不这样做,熊猫将无法加载数据,并且会出现错误:
import import pandas pandas as as pd
pd
data data = = pdpd .. read_csvread_csv (( "thanksgiving-2015-poll-data.csv""thanksgiving-2015-poll-data.csv" , , encodingencoding == "Latin-1""Latin-1" )
)
datadata .. headhead ()
()
RespondentID | 受访者编号 | Do you celebrate Thanksgiving? | 你庆祝感恩节吗? | What is typically the main dish at your Thanksgiving dinner? | 感恩节晚餐的主菜通常是什么? | What is typically the main dish at your Thanksgiving dinner? – Other (please specify) | 感恩节晚餐的主菜通常是什么? –其他(请注明) | How is the main dish typically cooked? | 主菜通常如何烹饪? | How is the main dish typically cooked? – Other (please specify) | 主菜通常如何烹饪? –其他(请注明) | What kind of stuffing/dressing do you typically have? | 您通常有什么样的馅料/配料? | What kind of stuffing/dressing do you typically have? – Other (please specify) | 您通常有什么样的馅料/配料? –其他(请注明) | What type of cranberry saucedo you typically have? | 您通常吃哪种类型的酸果蔓酱? | What type of cranberry saucedo you typically have? – Other (please specify) | 您通常吃哪种类型的酸果蔓酱? –其他(请注明) | … | … | Have you ever tried to meet up with hometown friends on Thanksgiving night? | 您是否曾经尝试过在感恩节之夜与家乡朋友见面? | Have you ever attended a “Friendsgiving?” | 您是否曾经参加过“友谊赛”? | Will you shop any Black Friday sales on Thanksgiving Day? | 您会在感恩节购物黑色星期五吗? | Do you work in retail? | 您从事零售业吗? | Will you employer make you work on Black Friday? | 您的雇主会让您在黑色星期五工作吗? | How would you describe where you live? | 您如何描述自己的住所? | Age | 年龄 | What is your gender? | 你的性别是什么? | How much total combined money did all members of your HOUSEHOLD earn last year? | 去年,您的家庭所有成员总共赚了多少钱? | US Region | 美国地区 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 4337954960 | 4337954960 | Yes | 是 | Turkey | 火鸡 | NaN | N | Baked | 烤 | NaN | N | Bread-based | 面包为主 | NaN | N | None | 没有 | NaN | N | … | … | Yes | 是 | No | 没有 | No | 没有 | No | 没有 | NaN | N | Suburban | 郊区的 | 18 – 29 | 18 – 29 | Male | 男 | $75,000 to $99,999 | $ 75,000至$ 99,999 | Middle Atlantic | 中大西洋 |
1 | 1个 | 4337951949 | 4337951949 | Yes | 是 | Turkey | 火鸡 | NaN | N | Baked | 烤 | NaN | N | Bread-based | 面包为主 | NaN | N | Other (please specify) | 其他(请注明) | Homemade cranberry gelatin ring | 自制蔓越莓明胶戒指 | … | … | No | 没有 | No | 没有 | Yes | 是 | No | 没有 | NaN | N | Rural | 乡村 | 18 – 29 | 18 – 29 | Female | 女 | $50,000 to $74,999 | $ 50,000至$ 74,999 | East South Central | 东南中 |
2 | 2 | 4337935621 | 4337935621 | Yes | 是 | Turkey | 火鸡 | NaN | N | Roasted | 烤 | NaN | N | Rice-based | 大米为主 | NaN | N | Homemade | 自制 | NaN | N | … | … | Yes | 是 | Yes | 是 | Yes | 是 | No | 没有 | NaN | N | Suburban | 郊区的 | 18 – 29 | 18 – 29 | Male | 男 | $0 to $9,999 | $ 0至$ 9,999 | Mountain | 山 |
3 | 3 | 4337933040 | 4337933040 | Yes | 是 | Turkey | 火鸡 | NaN | N | Baked | 烤 | NaN | N | Bread-based | 面包为主 | NaN | N | Homemade | 自制 | NaN | N | … | … | Yes | 是 | No | 没有 | No | 没有 | No | 没有 | NaN | N | Urban | 市区 | 30 – 44 | 30 – 44 | Male | 男 | $200,000 and up | $ 200,000起 | Pacific | 太平洋地区 |
4 | 4 | 4337931983 | 4337931983 | Yes | 是 | Tofurkey | 托福基 | NaN | N | Baked | 烤 | NaN | N | Bread-based | 面包为主 | NaN | N | Canned | 罐装 | NaN | N | … | … | Yes | 是 | No | 没有 | No | 没有 | No | 没有 | NaN | N | Urban | 市区 | 30 – 44 | 30 – 44 | Male | 男 | $100,000 to $124,999 | $ 100,000至$ 124,999 | Pacific | 太平洋地区 |
5 rows × 65 columns
5行×65列
As you can see above, the data has 65
columns of mostly categorical data. For example, the first column appears to allow for Yes
and No
responses only. Let’s verify by using the pandas.Series.unique method to see what unique values are in the Do you celebrate Thanksgiving?
column of data
:
正如您在上面看到的,数据具有65
列,主要是分类数据。 例如,第一列似乎仅允许Yes
和No
响应。 让我们通过使用pandas.Series.unique方法进行验证,以查看Do you celebrate Thanksgiving?
有哪些唯一值Do you celebrate Thanksgiving?
列data
:
array(['Yes', 'No'], dtype=object)
We can also view all the column names to see all of the survey questions. We’ll truncate the output below to save you from having to scroll:
我们还可以查看所有列名以查看所有调查问题。 我们将截断下面的输出,以免您不得不滚动:
datadata .. columnscolumns [[ 5050 :]
:]
Index(['Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply. - Other (please specify).1',
'Do you typically pray before or after the Thanksgiving meal?',
'How far will you travel for Thanksgiving?',
'Will you watch any of the following programs on Thanksgiving? Please select all that apply. - Macy's Parade',
'What's the age cutoff at your "kids' table" at Thanksgiving?',
'Have you ever tried to meet up with hometown friends on Thanksgiving night?',
'Have you ever attended a "Friendsgiving?"',
'Will you shop any Black Friday sales on Thanksgiving Day?',
'Do you work in retail?',
'Will you employer make you work on Black Friday?',
'How would you describe where you live?', 'Age', 'What is your gender?',
'How much total combined money did all members of your HOUSEHOLD earn last year?',
'US Region'],
dtype='object')
Using this Thanksgiving survey data, we can answer quite a few interesting questions, like:
使用此感恩节调查数据,我们可以回答很多有趣的问题,例如:
- Do people in Suburban areas eat more Tofurkey than people in Rural areas?
- Where do people go to Black Friday sales most often?
- Is there a correlation between praying on Thanksgiving and income?
- What income groups are most likely to have homemade cranberry sauce?
- 郊区的人们吃的糠醛比农村地区的人多吗?
- 人们最常去哪里买黑色星期五的商品?
- 在感恩节祈祷和收入之间有关联吗?
- 哪些收入群体最有可能自制蔓越莓酱?
In order to answer these questions and others, we’ll first need to become familiar with applying, grouping and aggregation in Pandas.
为了回答这些问题和其他问题,我们首先需要熟悉在Pandas中的应用,分组和聚合。
将功能应用于熊猫系列 (Applying functions to Series in pandas)
There are times when we’re using pandas that we want to apply a function to every row or every column in the data. A good example is getting from the values in our What is your gender?
column to numeric values. We’ll assign 0
to Male
, and 1
to Female
.
有时候,当我们使用熊猫时,我们希望对数据的每一行或每一列应用一个函数。 一个很好的例子就是从我们的价值观中得到What is your gender?
列到数值。 我们将指派0
到Male
和1
至Female
。
Before we dive into transforming the values, let’s confirm that the values in the column are either Male
or Female
. We can use the pandas.Series.value_counts method to help us with this. We’ll pass the dropna=False
keyword argument to also count missing values:
在深入研究转换值之前,让我们确认列中的值是Male
还是Female
。 我们可以使用pandas.Series.value_counts方法来帮助我们。 我们将传递dropna=False
关键字参数来也计算缺失值:
Female 544
Male 481
NaN 33
Name: What is your gender?, dtype: int64
As you can see, not all of the values are Male
or Female
. We’ll preserve any missing values in the final output when we transform our column. Here’s a diagram of the input and outputs we need:
如您所见,并非所有值都是Male
或Female
。 转换列时,我们将在最终输出中保留所有缺少的值。 这是我们需要的输入和输出的图表:
+—————–+ +————–+ | What is your | | gender | | gender? | | | +—————–+ +————–+ | | | | | Male | | 0 | | | transform | | +—————–+ column with +————–+ | | apply | | | Female | +————-> | 1 | | | | | +—————–+ +————–+ | | | | | NaN | | NaN | | | | | +—————–+ +————–+ | | | | | Male | | 0 | | | | | +—————–+ +————–+ | | | | | Female | | 1 | | | | | |—————–+ +————–+
+ ——————– + + ————– + | 你是什么 | 性别| | 性别? | | | + ——————– + + ————– + | | | | | 男| | 0 | | | 转换| | + ——————– +带有+ ————– + |的列 | 申请| | | 女| + ————-> | 1 | | | | | + ——————– + + ————– + | | | | | NaN | | NaN | | | | | + ——————– + + ————– + | | | | | 男| | 0 | | | | | + ——————– + + ————– + | | | | | 女| | 1 | | | | | | ——————– + + ————– +
We’ll need to apply a custom function to each value in the What is your gender?
column to get the output we want. Here’s a function that will do the transformation we want:
我们需要对“ What is your gender?
是What is your gender?
中的每个值应用自定义函数What is your gender?
列以获取所需的输出。 这是一个可以完成我们想要的转换的函数:
import import math
math
def def gender_codegender_code (( gender_stringgender_string ):
):
if if isinstanceisinstance (( gender_stringgender_string , , floatfloat ) ) and and mathmath .. isnanisnan (( gender_stringgender_string ):
):
return return gender_string
gender_string
return return intint (( gender_string gender_string == == "Female""Female" )
)
In order to apply this function to each item in the What is your gender?
column, we could either write a for loop, and loop across each element in the column, or we could use the pandas.Series.apply method.
为了将此功能应用于What is your gender?
是What is your gender?
列,我们可以编写一个for循环,并遍历该列中的每个元素,也可以使用pandas.Series.apply方法。
This method will take a function as input, then return a new pandas Series that contains the results of applying the function to each item in the Series. We can assign the result back to a column in the data
DataFrame, then verify the results using value_counts
:
此方法将以函数为输入,然后返回一个新的pandas系列,其中包含将该函数应用于系列中的每个项目的结果。 我们可以将结果分配回data
DataFrame中的一列,然后使用value_counts
验证结果:
1.0 544
0.0 481
NaN 33
Name: gender, dtype: int64
将函数应用于Pandas中的DataFrames (Applying functions to DataFrames in pandas)
We can use the apply
method on DataFrames as well as Series. When we use the pandas.DataFrame.apply method, an entire row or column will be passed into the function we specify. By default, apply
will work across each column in the DataFrame. If we pass the axis=1
keyword argument, it will work across each row.
我们可以在DataFrames以及Series上使用apply
方法。 当我们使用pandas.DataFrame.apply方法时,整行或整列将传递到我们指定的函数中。 默认情况下, apply
将跨DataFrame中的每一列工作。 如果我们传递axis=1
关键字参数,它将适用于每一行。
In the below example, we check the data type of each column in data
using a lambda function. We also call the head
method on the result to avoid having too much output:
在下面的例子中,我们检查每一列中的数据类型data
使用lambda函数 。 我们还对结果调用head
方法,以避免输出过多:
datadata .. applyapply (( lambda lambda xx : : xx .. dtypedtype )) .. headhead ()
()
RespondentID object
Do you celebrate Thanksgiving? object
What is typically the main dish at your Thanksgiving dinner? object
What is typically the main dish at your Thanksgiving dinner? - Other (please specify) object
How is the main dish typically cooked? object
dtype: object
使用申请方法清理收入 (Using the apply method to clean up income)
We can now use what we know about the apply
method to clean up the How much total combined money did all members of your HOUSEHOLD earn last year?
column. Cleaning up the income column will allow us to go from string values to numeric values. First, let’s see all the unique values that are in the How much total combined money did all members of your HOUSEHOLD earn last year?
column:
现在,我们可以使用对apply
方法了解的方法来清理How much total combined money did all members of your HOUSEHOLD earn last year?
柱。 清理收入列将使我们能够从字符串值转换为数字值。 首先,让我们看看How much total combined money did all members of your HOUSEHOLD earn last year?
柱:
$25,000 to $49,999 180
Prefer not to answer 136
$50,000 to $74,999 135
$75,000 to $99,999 133
$100,000 to $124,999 111
$200,000 and up 80
$10,000 to $24,999 68
$0 to $9,999 66
$125,000 to $149,999 49
$150,000 to $174,999 40
NaN 33
$175,000 to $199,999 27
Name: How much total combined money did all members of your HOUSEHOLD earn last year?, dtype: int64
Looking at this, there are 4
different patterns for the values in the column:
对此,列中的值有4
种不同的模式:
X to Y
– an example is$25,000 to $49,999
.- We can convert this to a numeric value by extracting the numbers and averaging them.
NaN
- We’ll preserve
NaN
values, and not convert them at all.
- We’ll preserve
X and up
– an example is$200,000 and up
.- We can convert this to a numeric value by extracting the number.
Prefer not to answer
- We’ll turn this into an
NaN
value.
- We’ll turn this into an
-
X to Y
–例如$25,000 to $49,999
。- 我们可以通过提取数字并将其取平均值来将其转换为数值。
-
NaN
- 我们将保留
NaN
值,而不完全转换它们。
- 我们将保留
-
X and up
–例如$200,000 and up
。- 我们可以通过提取数字将其转换为数值。
-
Prefer not to answer
- 我们将其转换为
NaN
值。
- 我们将其转换为
Here is how we want the transformations to work:
这是我们希望转换起作用的方式:
+—————–+ +————–+ | How much total | | income | | combined … | | | +—————–+ +————–+ | $25,000 | | | | to | | 37499.5 | | $49,999 | transform | | +—————–+ column with +————–+ | Prefer | apply | | | not to not to | ————–> | NaN | | answer | | | +—————–+ +————–+ | | | | | NaN | | NaN | | | | | +—————–+ +————–+ | $200,000 | | | | and up | | 200000 | | | | | +—————–+ +————–+ | $175,000 | | | | to | | 187499.5 | | $199,999 | | | |—————–+ +————–+
+ ——————– + + ————– + | 总计多少 | 收入| | 结合…| | | + ——————– + + ————– + | $ 25,000 | | | | 到| | 37499.5 | | $ 49,999 | 转换| | + ——————– +带有+ ————– + |的列 更喜欢 申请| | | 不不 ——————> | NaN | | 回答| | | + ——————– + + ————– + | | | | | NaN | | NaN | | | | | + ——————– + + ————– + | $ 200,000 | | | | 及以上| | 200000 | | | | | + ——————– + + ————– + | $ 175,000 | | | | 到| | 187499.5 | | $ 199,999 | | | | ——————– + + ————– +
We can write a function that covers all of these cases. In the below function, we:
我们可以编写一个涵盖所有这些情况的函数。 在下面的函数中,我们:
- Take a string called
value
as input. - Check to see if
value
is$200,000 and up
, and return200000
if so. - Check if
value
isPrefer not to answer
, and returnNaN
if so. - Check if
value
isNaN
, and returnNaN
if so. - Clean up
value
by removing any dollar signs or commas. - Split the string to extract the incomes, then average them.
- 以一个名为
value
的字符串作为输入。 - 检查
value
是否为$200,000 and up
,如果是,则返回200000
。 - 检查
value
是否为Prefer not to answer
,如果是,则返回NaN
。 - 检查
value
是否为NaN
,如果是,则返回NaN
。 - 通过删除任何美元符号或逗号来清理
value
。 - 分割字符串以提取收入,然后将其平均。
import import numpy numpy as as np
np
def def clean_incomeclean_income (( valuevalue ):
):
if if value value == == "$200,000 and up""$200,000 and up" :
:
return return 200000
200000
elif elif value value == == "Prefer not to answer""Prefer not to answer" :
:
return return npnp .. nan
nan
elif elif isinstanceisinstance (( valuevalue , , floatfloat ) ) and and mathmath .. isnanisnan (( valuevalue ):
):
return return npnp .. nan
nan
value value = = valuevalue .. replacereplace (( ",""," , , """" )) .. replacereplace (( "$""$" , , """" )
)
income_highincome_high , , income_low income_low = = valuevalue .. splitsplit (( " to "" to " )
)
return return (( intint (( income_highincome_high ) ) + + intint (( income_lowincome_low )) )) / / 2
2
喜欢这篇文章吗? 使用Dataquest学习数据科学! (Enjoying this post? Learn data science with Dataquest!)
- Learn from the comfort of your browser.
- Work with real-life data sets.
- Build a portfolio of projects.
- 从舒适的浏览器中学习。
- 处理实际数据集。
- 建立项目组合。
After creating the function, we can apply it to the How much total combined money did all members of your HOUSEHOLD earn last year?
column:
创建函数后,我们可以将其应用于How much total combined money did all members of your HOUSEHOLD earn last year?
柱:
0 87499.5
1 62499.5
2 4999.5
3 200000.0
4 112499.5
Name: income, dtype: float64
用熊猫分组数据 (Grouping data with pandas)
Now that we’ve covered applying functions, we can move on to grouping data using pandas. When performing data analysis, it’s often useful to explore only a subset of the data. For example, what if we want to compare income between people who tend to eat homemade cranberry sauce for Thanksgiving vs people who eat canned cranberry sauce? First, let’s see what the unique values in the column are:
既然我们已经介绍了应用函数,我们就可以继续使用熊猫对数据进行分组了。 在执行数据分析时,仅浏览一部分数据通常很有用。 例如,如果我们想比较倾向于吃自制蔓越莓酱的人与吃罐头蔓越莓酱的人的收入,该怎么办? 首先,让我们看看该列中的唯一值是:
datadata [[ "What type of cranberry saucedo you typically have?""What type of cranberry saucedo you typically have?" ]] .. value_countsvalue_counts ()
()
Canned 502
Homemade 301
None 146
Other (please specify) 25
Name: What type of cranberry saucedo you typically have?, dtype: int64
We can now filter data
to get two DataFrames that only contain rows where the What type of cranberry saucedo you typically have?
is Canned
or Homemade
, respectively:
现在,我们可以过滤data
以获得两个仅包含行的data
框, What type of cranberry saucedo you typically have?
分别是Canned
或Homemade
:
Finally, we can use the pandas.Series.mean method to find the average income in homemade
and canned
:
最后,我们可以使用pandas.Series.mean方法找到homemade
和canned
的平均收入:
printprint (( homemadehomemade [[ "income""income" ]] .. meanmean ())
())
printprint (( cannedcanned [[ "income""income" ]] .. meanmean ())
())
94878.1072874
83823.4034091
We get our answer, but it took more lines of code than it should have. What if we now want to compute the average income for people who didn’t have cranberry sauce?
我们得到了答案,但是花了更多的代码行。 如果我们现在想计算没有酸果蔓酱的人的平均收入怎么办?
An easier way to find groupwise summary statistics with pandas is to use the pandas.DataFrame.groupby method. This method will split a DataFrame into groups based on a column or set of columns. We’ll then be able to perform computations on each group.
使用pandas查找分组汇总统计信息的更简单方法是使用pandas.DataFrame.groupby方法。 此方法将根据一列或一组列将DataFrame分为几组。 然后,我们将能够对每个组执行计算。
Here’s how splitting data
based on the What type of cranberry saucedo you typically have?
column would look:
这是如何根据What type of cranberry saucedo you typically have?
拆分data
的方式What type of cranberry saucedo you typically have?
列将如下所示:
groups data Split up based on the value of the “What type of +————+————-+ +———–+—————–+ cranberry sauce” | | | | income | What type of | column | 200000 | Homemade | | | cranberry sauce | +———————–> | | | +—————————–+ | +————————–+ | | | | | | | | 200000 | Homemade | | | 187499.5 | Homemade | | | | | | | | +—————————–+ | +————+————-+ | | | | | 4999.5 | Canned | | | | | | +————+————–+ +—————————–+ | | | | | | | +———————–> | NaN | None | | 187499.5 | Homemade | | | | | | | | | +————+————–+ +—————————–+ | | | | | | NaN | None | | | | | | +—————————–+ | | | | | +————+————–+ | 200000 | Canned | | | | | | | | | | 4999.5 | Canned | +———–+—————–+ +———————–> | | | +—————————+ | | | | 200000 | Canned | | | | +————+————–+
分组数据根据“什么类型的+ ———— + ————- ++ ———— ++ —————— ++蔓越莓酱”的值进行拆分| | | | 收入| 什么类型的 栏| 200000 | 自制| | | 酸果蔓酱| + ——————————> | | + ———————————— ++ + —————————— ++ | | | | | | | 200000 | 自制| | | 187499.5 | 自制| | | | | | | | + ———————————— ++ + ———— ++ ————- + | | | | | 4999.5 | 罐装| | | | | | + ———— ++ ————– + + ————————- ++ | | | | | | | + ——————————> NaN | 无| | 187499.5 | 自制| | | | | | | | | + ———— ++ ————– + + ————————- ++ | | | | | | NaN | 无| | | | | | + ———————————— ++ | | | | + —————— + ————– + | 200000 | 罐装| | | | | | | | | | 4999.5 | 罐装| + —————— + ———————— + + ——————-> | | + —————————— + | | | | 200000 | 罐装| | | | + —————— + ————– +
Note how each resulting group only has a single unique value in the What type of cranberry saucedo you typically have?
column. One group is created for each unique value in the column we choose to group by.
请注意What type of cranberry saucedo you typically have?
在What type of cranberry saucedo you typically have?
中,每个结果组如何只有一个唯一的值What type of cranberry saucedo you typically have?
柱。 我们选择分组的列中的每个唯一值都会创建一个分组。
Let’s create groups from the What type of cranberry saucedo you typically have?
column:
让我们What type of cranberry saucedo you typically have?
的What type of cranberry saucedo you typically have?
创建组What type of cranberry saucedo you typically have?
柱:
<pandas.core.groupby.DataFrameGroupBy object at 0x10a22cc50>
As you can see above, the groupby
method returns a DataFrameGroupBy
object. We can call the pandas.GroupBy.groups method to see what value for the What type of cranberry saucedo you typically have?
column is in each group:
如上所示, groupby
方法返回一个DataFrameGroupBy
对象。 我们可以调用pandas.GroupBy.groups方法来查看What type of cranberry saucedo you typically have?
值What type of cranberry saucedo you typically have?
列在每个组中:
groupedgrouped .. groups
groups
{'Canned': Int64Index([ 4, 6, 8, 11, 12, 15, 18, 19, 26, 27,
...
1040, 1041, 1042, 1044, 1045, 1046, 1047, 1051, 1054, 1057],
dtype='int64', length=502),
'Homemade': Int64Index([ 2, 3, 5, 7, 13, 14, 16, 20, 21, 23,
...
1016, 1017, 1025, 1027, 1030, 1034, 1048, 1049, 1053, 1056],
dtype='int64', length=301),
'None': Int64Index([ 0, 17, 24, 29, 34, 36, 40, 47, 49, 51,
...
980, 981, 997, 1015, 1018, 1031, 1037, 1043, 1050, 1055],
dtype='int64', length=146),
'Other (please specify)': Int64Index([ 1, 9, 154, 216, 221, 233, 249, 265, 301, 336, 380,
435, 444, 447, 513, 550, 749, 750, 784, 807, 860, 872,
905, 1000, 1007],
dtype='int64')}
We can call the pandas.GroupBy.size method to see how many rows are in each group. This is equivalent to the value_counts
method on a Series:
我们可以调用pandas.GroupBy.size方法来查看每个组中有多少行。 这等效于Series上的value_counts
方法:
What type of cranberry saucedo you typically have?
Canned 502
Homemade 301
None 146
Other (please specify) 25
dtype: int64
We can also use a loop to manually iterate through the groups:
我们还可以使用循环来手动遍历各组:
for for namename , , group group in in groupedgrouped :
:
printprint (( namename )
)
printprint (( groupgroup .. shapeshape )
)
printprint (( typetype (( groupgroup ))
))
Canned
(502, 67)
<class 'pandas.core.frame.DataFrame'>
Homemade
(301, 67)
<class 'pandas.core.frame.DataFrame'>
None
(146, 67)
<class 'pandas.core.frame.DataFrame'>
Other (please specify)
(25, 67)
<class 'pandas.core.frame.DataFrame'>
As you can see above, each group is a DataFrame, and you can use any normal DataFrame methods on it.
如上所示,每个组都是一个DataFrame,并且可以在其上使用任何常规DataFrame方法。
We can also extract a single column from a group. This will allow us to perform further computations just on that specific column:
我们还可以从组中提取单个列。 这将使我们能够仅在该特定列上执行进一步的计算:
<pandas.core.groupby.SeriesGroupBy object at 0x1081ef390>
As you can see above, this gives us a SeriesGroupBy
object. We can then call the normal methods we can call on a DataFrameGroupBy
object:
如您在上面看到的,这给了我们一个SeriesGroupBy
对象。 然后,我们可以调用可以在DataFrameGroupBy
对象上调用的常规方法:
groupedgrouped [[ "income""income" ]] .. sizesize ()
()
What type of cranberry saucedo you typically have?
Canned 502
Homemade 301
None 146
Other (please specify) 25
dtype: int64
分组汇总值 (Aggregating values in groups)
If all we could do was split a DataFrame into groups, it wouldn’t be of much use. The real power of groups is in the computations we can do after creating groups. We do these computations through the pandas.GroupBy.aggregate method, which we can abbreviate as agg
. This method allows us to perform the same computation on every group.
如果我们能做的只是将DataFrame分成几组,那么它就没有多大用处了。 组的真正力量在于创建组后可以进行的计算。 我们通过pandas.GroupBy.aggregate方法执行这些计算,我们可以将其缩写为agg
。 这种方法使我们可以对每个组执行相同的计算。
For example, we could find the average income for people who served each type of cranberry sauce for Thanksgiving (Canned
, Homemade
, None
, etc).
例如,我们可以找到为感恩节食用每种蔓越莓酱( Canned
, Homemade
, None
等)的人们的平均收入。
In the below code, we:
在下面的代码中,我们:
- Extract just the
income
column fromgrouped
, so we don’t find the average of every column. - Call the
agg
method withnp.mean
as input.- This will compute the mean for each group, then combine the results from each group.
- 从
grouped
仅提取income
列,因此我们没有找到每一列的平均值。 - 使用
np.mean
作为输入来调用agg
方法。- 这将计算每组的平均值,然后合并每组的结果。
What type of cranberry saucedo you typically have?
Canned 83823.403409
Homemade 94878.107287
None 78886.084034
Other (please specify) 86629.978261
Name: income, dtype: float64
If we left out only selecting the income
column, here’s what we’d get:
如果我们只选择了income
列,那么将得到以下结果:
groupedgrouped .. aggagg (( npnp .. meanmean )
)
RespondentID | 受访者编号 | gender | 性别 | income | 收入 | ||
---|---|---|---|---|---|---|---|
What type of cranberry saucedo you typically have? | 您通常吃哪种类型的酸果蔓酱? | ||||||
Canned | 罐装 | 4336699416 | 4336699416 | 0.552846 | 0.552846 | 83823.403409 | 83823.403409 |
Homemade | 自制 | 4336792040 | 4336792040 | 0.533101 | 0.533101 | 94878.107287 | 94878.107287 |
None | 没有 | 4336764989 | 4336764989 | 0.517483 | 0.517483 | 78886.084034 | 78886.084034 |
Other (please specify) | 其他(请注明) | 4336763253 | 4336763253 | 0.640000 | 0.640000 | 86629.978261 | 86629.978261 |
The above code will find the mean for each group for every column in data
. However, most columns are string columns, not integer or float columns, so pandas didn’t process them, since calling np.mean
on them raised an error.
上面的代码将找到data
每一列的每个组的平均值。 但是,大多数列是字符串列,而不是整数列或浮点列,因此熊猫np.mean
它们,因为对它们调用np.mean
引发错误。
绘制汇总结果 (Plotting the results of aggregation)
We can make a plot using the results of our agg
method. This will create a bar chart that shows the average income of each category.
我们可以使用agg
方法的结果进行绘制。 这将创建一个条形图,显示每个类别的平均收入。
In the below code, we:
在下面的代码中,我们:
<matplotlib.axes._subplots.AxesSubplot at 0x109ebacc0>
聚合多列 (Aggregating with multiple columns)
We can call groupby
with multiple columns as input to get more granular groups. If we use the What type of cranberry saucedo you typically have?
and What is typically the main dish at your Thanksgiving dinner?
columns as input, we’ll be able to find the average income of people who eat Homemade
cranberry sauce and Tofurkey
, for example:
我们可以使用多列作为输入来调用groupby
以获得更多的细化组。 如果我们使用What type of cranberry saucedo you typically have?
What is typically the main dish at your Thanksgiving dinner?
列作为输入,我们将能够找到吃Homemade
蔓越莓酱和Tofurkey
的人的平均收入,例如:
grouped grouped = = datadata .. groupbygroupby ([([ "What type of cranberry saucedo you typically have?""What type of cranberry saucedo you typically have?" , , "What is typically the main dish at your Thanksgiving dinner?""What is typically the main dish at your Thanksgiving dinner?" ])
])
groupedgrouped .. aggagg (( npnp .. meanmean )
)
RespondentID | 受访者编号 | gender | 性别 | income | 收入 | ||||
---|---|---|---|---|---|---|---|---|---|
What type of cranberry saucedo you typically have? | 您通常吃哪种类型的酸果蔓酱? | What is typically the main dish at your Thanksgiving dinner? | 感恩节晚餐的主菜通常是什么? | ||||||
Canned | 罐装 | Chicken | 鸡 | 4336354418 | 4336354418 | 0.333333 | 0.333333 | 80999.600000 | 80999.600000 |
Ham/Pork | 火腿/猪肉 | 4336757434 | 4336757434 | 0.642857 | 0.642857 | 77499.535714 | 77499.535714 | ||
I don’t know | 我不知道 | 4335987430 | 4335987430 | 0.000000 | 0.000000 | 4999.500000 | 4999.500000 | ||
Other (please specify) | 其他(请注明) | 4336682072 | 4336682072 | 1.000000 | 1.000000 | 53213.785714 | 53213.785714 | ||
Roast beef | 烤牛肉 | 4336254414 | 4336254414 | 0.571429 | 0.571429 | 25499.500000 | 25499.500000 | ||
Tofurkey | 托福基 | 4337156546 | 4337156546 | 0.714286 | 0.714286 | 100713.857143 | 100713.857143 | ||
Turkey | 火鸡 | 4336705225 | 4336705225 | 0.544444 | 0.544444 | 85242.682045 | 85242.682045 | ||
Homemade | 自制 | Chicken | 鸡 | 4336539693 | 4336539693 | 0.750000 | 0.750000 | 19999.500000 | 19999.500000 |
Ham/Pork | 火腿/猪肉 | 4337252861 | 4337252861 | 0.250000 | 0.250000 | 96874.625000 | 96874.625000 | ||
I don’t know | 我不知道 | 4336083561 | 4336083561 | 1.000000 | 1.000000 | NaN | N | ||
Other (please specify) | 其他(请注明) | 4336863306 | 4336863306 | 0.600000 | 0.600000 | 55356.642857 | 55356.642857 | ||
Roast beef | 烤牛肉 | 4336173790 | 4336173790 | 0.000000 | 0.000000 | 33749.500000 | 33749.500000 | ||
Tofurkey | 托福基 | 4336789676 | 4336789676 | 0.666667 | 0.666667 | 57916.166667 | 57916.166667 | ||
Turducken | Turducken | 4337475308 | 4337475308 | 0.500000 | 0.500000 | 200000.000000 | 200000.000000 | ||
Turkey | 火鸡 | 4336790802 | 4336790802 | 0.531008 | 0.531008 | 97690.147982 | 97690.147982 | ||
None | 没有 | Chicken | 鸡 | 4336150656 | 4336150656 | 0.500000 | 0.500000 | 11249.500000 | 11249.500000 |
Ham/Pork | 火腿/猪肉 | 4336679896 | 4336679896 | 0.444444 | 0.444444 | 61249.500000 | 61249.500000 | ||
I don’t know | 我不知道 | 4336412261 | 4336412261 | 0.500000 | 0.500000 | 33749.500000 | 33749.500000 | ||
Other (please specify) | 其他(请注明) | 4336687790 | 4336687790 | 0.600000 | 0.600000 | 119106.678571 | 119106.678571 | ||
Roast beef | 烤牛肉 | 4337423740 | 4337423740 | 0.000000 | 0.000000 | 162499.500000 | 162499.500000 | ||
Tofurkey | 托福基 | 4336950068 | 4336950068 | 0.500000 | 0.500000 | 112499.500000 | 112499.500000 | ||
Turducken | Turducken | 4336738591 | 4336738591 | 0.000000 | 0.000000 | NaN | N | ||
Turkey | 火鸡 | 4336784218 | 4336784218 | 0.523364 | 0.523364 | 74606.275281 | 74606.275281 | ||
Other (please specify) | 其他(请注明) | Ham/Pork | 火腿/猪肉 | 4336465104 | 4336465104 | 1.000000 | 1.000000 | 87499.500000 | 87499.500000 |
Other (please specify) | 其他(请注明) | 4337335395 | 4337335395 | 0.000000 | 0.000000 | 124999.666667 | 124999.666667 | ||
Tofurkey | 托福基 | 4336121663 | 4336121663 | 1.000000 | 1.000000 | 37499.500000 | 37499.500000 | ||
Turkey | 火鸡 | 4336724418 | 4336724418 | 0.700000 | 0.700000 | 82916.194444 | 82916.194444 |
As you can see above, we get a nice table that shows us the mean of each column for each group. This enables us to find some interesting patterns, such as:
如您在上面看到的,我们得到一个漂亮的表,该表向我们显示了每个组的每一列的平均值。 这使我们能够找到一些有趣的模式,例如:
- People who have
Turducken
andHomemade
cranberry sauce seem to have high household incomes. - People who eat
Canned
cranberry sauce tend to have lower incomes, but those who also haveRoast Beef
have the lowest incomes. - It looks like there’s one person who has
Canned
cranberry sauce and doesn’t know what type of main dish he’s having.
- 拥有
Turducken
和Homemade
蔓越莓酱的人似乎家庭收入很高。 - 吃酸果蔓酱
Canned
人收入较低,但是也有Roast Beef
人收入最低。 - 好像有一个人
Canned
酸果蔓酱,却不知道他吃的是哪种主菜。
整合多种功能 (Aggregating with multiple functions)
We can also perform aggregation with multiple functions. This enables us to calculate the mean and standard deviation of a group, for example. In the below code, we find the sum, standard deviation, and mean of each group in the income
column:
我们还可以使用多种功能执行汇总。 例如,这使我们能够计算一组的平均值和标准偏差。 在下面的代码中,我们在income
列中找到每个组的总和,标准差和均值:
mean | 意思 | sum | 和 | std | 性病 | ||||
---|---|---|---|---|---|---|---|---|---|
What type of cranberry saucedo you typically have? | 您通常吃哪种类型的酸果蔓酱? | What is typically the main dish at your Thanksgiving dinner? | 感恩节晚餐的主菜通常是什么? | ||||||
Canned | 罐装 | Chicken | 鸡 | 80999.600000 | 80999.600000 | 404998.0 | 404998.0 | 75779.481062 | 75779.481062 |
Ham/Pork | 火腿/猪肉 | 77499.535714 | 77499.535714 | 1084993.5 | 1084993.5 | 56645.063944 | 56645.063944 | ||
I don’t know | 我不知道 | 4999.500000 | 4999.500000 | 4999.5 | 4999.5 | NaN | N | ||
Other (please specify) | 其他(请注明) | 53213.785714 | 53213.785714 | 372496.5 | 372496.5 | 29780.946290 | 29780.946290 | ||
Roast beef | 烤牛肉 | 25499.500000 | 25499.500000 | 127497.5 | 127497.5 | 24584.039538 | 24584.039538 | ||
Tofurkey | 托福基 | 100713.857143 | 100713.857143 | 704997.0 | 704997.0 | 61351.484439 | 61351.484439 | ||
Turkey | 火鸡 | 85242.682045 | 85242.682045 | 34182315.5 | 34182315.5 | 55687.436102 | 55687.436102 | ||
Homemade | 自制 | Chicken | 鸡 | 19999.500000 | 19999.500000 | 59998.5 | 59998.5 | 16393.596311 | 16393.596311 |
Ham/Pork | 火腿/猪肉 | 96874.625000 | 96874.625000 | 387498.5 | 387498.5 | 77308.452805 | 77308.452805 | ||
I don’t know | 我不知道 | NaN | N | NaN | N | NaN | N |
使用适用于团体 (Using apply on groups)
One of the limitations of aggregation is that each function has to return a single number. While we can perform computations like finding the mean, we can’t for example, call value_counts
to get the exact count of a category. We can do this using the pandas.GroupBy.apply method. This method will apply a function to each group, then combine the results.
聚合的局限性之一是每个函数必须返回一个数字。 尽管我们可以执行诸如求平均值的计算,但是例如不能调用value_counts
来获取类别的确切计数。 我们可以使用pandas.GroupBy.apply方法做到这一点。 此方法将对每个组应用一个函数,然后合并结果。
In the below code, we’ll apply value_counts
to find the number of people who live in each area type (Rural
, Suburban
, etc) who eat different kinds of main dishes for Thanksgiving:
在下面的代码中,我们将应用value_counts
来查找每种地区类型( Rural
, Suburban
等)中生活在感恩节期间吃不同种类的主菜的人数:
grouped grouped = = datadata .. groupbygroupby (( "How would you describe where you live?""How would you describe where you live?" )[)[ "What is typically the main dish at your Thanksgiving dinner?""What is typically the main dish at your Thanksgiving dinner?" ]
]
groupedgrouped .. applyapply (( lambda lambda xx :: xx .. value_countsvalue_counts ())
())
How would you describe where you live?
Rural Turkey 189
Other (please specify) 9
Ham/Pork 7
I don't know 3
Tofurkey 3
Turducken 2
Chicken 2
Roast beef 1
Suburban Turkey 449
Ham/Pork 17
Other (please specify) 13
Tofurkey 9
Roast beef 3
Chicken 3
Turducken 1
I don't know 1
Urban Turkey 198
Other (please specify) 13
Tofurkey 8
Chicken 7
Roast beef 6
Ham/Pork 4
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64
The above table shows us that people who live in different types of areas eat different Thanksgiving main dishes at about the same rate.
上表显示,居住在不同类型地区的人们以相同的比率吃着不同的感恩节主菜。
进一步阅读 (Further reading)
In this tutorial, we learned how to use pandas to group data, and calculate results. We learned several techniques for manipulating groups and finding patterns.
在本教程中,我们学习了如何使用熊猫对数据进行分组和计算结果。 我们学习了几种用于操纵群体和寻找模式的技术。
翻译自: https://www.pybloggers.com/2016/12/pandas-tutorial-data-analysis-with-python-part-2/