Pandas教程:使用Python进行数据分析:第2部分

We covered a lot of ground in Part 1 of our pandas tutorial. We went from the basics of pandas DataFrames to indexing and computations. If you’re still not confident with Pandas, you might want to check out the Dataquest pandas Course.

熊猫教程的第1部分中,我们介绍了很多内容。 我们从熊猫DataFrames的基础到索引和计算。 如果您对Pandas仍然不满意,则可以查看Dataquest pandas课程

In this tutorial, we’ll dive into one of the most powerful aspects of pandas – its grouping and aggregation functionality. With this functionality, it’s dead simple to compute group summary statistics, discover patterns, and slice up your data in various ways.

在本教程中,我们将深入研究熊猫最强大的方面之一-分组和聚合功能。 使用此功能,以各种方式计算组摘要统计信息,发现模式并分割数据非常简单。

Since Thanksgiving was just last week, we’ll use a dataset on what Americans typically eat for Thanksgiving dinner as we explore the pandas library. You can download the dataset here. It contains 1058 online survey responses collected by FiveThirtyEight. Each survey respondent was asked questions about what they typically eat for Thanksgiving, along with some demographic questions, like their gender, income, and location. This dataset will allow us to discover regional and income-based patterns in what Americans eat for Thanksgiving dinner. As we explore the data and try to find patterns, we’ll be heavily using the grouping and aggregation functionality of pandas.

由于感恩节就在上周,因此我们将使用数据集探​​索美国人在熊猫晚餐时通常吃什么。 您可以在此处下载数据集。 它包含由FiveThirtyEight收集的1058在线调查答复。 每个受访者都会被问到他们通常在感恩节吃什么的问题,以及一些人口统计学问题,例如他们的性别,收入和位置。 该数据集将使我们能够发现美国人在感恩节大餐中所吃的食物的区域性和基于收入的模式。 当我们探索数据并尝试找到模式时,我们将大量使用熊猫的分组和聚合功能。

We’re very into Thanksgiving dinner in America.

我们非常想参加美国的感恩节晚餐。

Just as a note, we’ll be using Python 3.5 and Jupyter Notebook to do our analysis.

请注意,我们将使用Python 3.5Jupyter Notebook进行分析。

读入并汇总数据 (Reading in and summarizing the data)

Our first step is to read in the data and do some preliminary exploration. This will help us figure out how we want to approach creating groups and finding patterns.

我们的第一步是读取数据并进行一些初步的探索。 这将帮助我们弄清楚如何创建组和查找模式。

As you may recall from part one of this tutorial, we can read in the data using the pandas.read_csv function. The data is stored using Latin-1 encoding, so we additionally need to specify the encoding keyword argument. If we don’t, pandas won’t be able to load in the data, and we’ll get an error:

您可能会从本教程的第一部分中回想起,我们可以使用pandas.read_csv函数读取数据。 数据使用Latin-1编码存储,因此我们还需要指定encoding关键字参数。 如果不这样做,熊猫将无法加载数据,并且会出现错误:

import import pandas pandas as as pd

pd

data data = = pdpd .. read_csvread_csv (( "thanksgiving-2015-poll-data.csv""thanksgiving-2015-poll-data.csv" , , encodingencoding == "Latin-1""Latin-1" )
)
datadata .. headhead ()
()
RespondentID 受访者编号 Do you celebrate Thanksgiving? 你庆祝感恩节吗? What is typically the main dish at your Thanksgiving dinner? 感恩节晚餐的主菜通常是什么? What is typically the main dish at your Thanksgiving dinner? – Other (please specify) 感恩节晚餐的主菜通常是什么? –其他(请注明) How is the main dish typically cooked? 主菜通常如何烹饪? How is the main dish typically cooked? – Other (please specify) 主菜通常如何烹饪? –其他(请注明) What kind of stuffing/dressing do you typically have? 您通常有什么样的馅料/配料? What kind of stuffing/dressing do you typically have? – Other (please specify) 您通常有什么样的馅料/配料? –其他(请注明) What type of cranberry saucedo you typically have? 您通常吃哪种类型的酸果蔓酱? What type of cranberry saucedo you typically have? – Other (please specify) 您通常吃哪种类型的酸果蔓酱? –其他(请注明) Have you ever tried to meet up with hometown friends on Thanksgiving night? 您是否曾经尝试过在感恩节之夜与家乡朋友见面? Have you ever attended a “Friendsgiving?” 您是否曾经参加过“友谊赛”? Will you shop any Black Friday sales on Thanksgiving Day? 您会在感恩节购物黑色星期五吗? Do you work in retail? 您从事零售业吗? Will you employer make you work on Black Friday? 您的雇主会让您在黑色星期五工作吗? How would you describe where you live? 您如何描述自己的住所? Age 年龄 What is your gender? 你的性别是什么? How much total combined money did all members of your HOUSEHOLD earn last year? 去年,您的家庭所有成员总共赚了多少钱? US Region 美国地区
0 0 4337954960 4337954960 YesTurkey 火鸡 NaN N BakedNaN N Bread-based 面包为主 NaN N None 没有 NaN N YesNo 没有 No 没有 No 没有 NaN N Suburban 郊区的 18 – 29 18 – 29 Male$75,000 to $99,999 $ 75,000至$ 99,999 Middle Atlantic 中大西洋
1 1个 4337951949 4337951949 YesTurkey 火鸡 NaN N BakedNaN N Bread-based 面包为主 NaN N Other (please specify) 其他(请注明) Homemade cranberry gelatin ring 自制蔓越莓明胶戒指 No 没有 No 没有 YesNo 没有 NaN N Rural 乡村 18 – 29 18 – 29 Female$50,000 to $74,999 $ 50,000至$ 74,999 East South Central 东南中
2 2 4337935621 4337935621 YesTurkey 火鸡 NaN N RoastedNaN N Rice-based 大米为主 NaN N Homemade 自制 NaN N YesYesYesNo 没有 NaN N Suburban 郊区的 18 – 29 18 – 29 Male$0 to $9,999 $ 0至$ 9,999 Mountain
3 3 4337933040 4337933040 YesTurkey 火鸡 NaN N BakedNaN N Bread-based 面包为主 NaN N Homemade 自制 NaN N YesNo 没有 No 没有 No 没有 NaN N Urban 市区 30 – 44 30 – 44 Male$200,000 and up $ 200,000起 Pacific 太平洋地区
4 4 4337931983 4337931983 YesTofurkey 托福基 NaN N BakedNaN N Bread-based 面包为主 NaN N Canned 罐装 NaN N YesNo 没有 No 没有 No 没有 NaN N Urban 市区 30 – 44 30 – 44 Male$100,000 to $124,999 $ 100,000至$ 124,999 Pacific 太平洋地区

5 rows × 65 columns

5行×65列

As you can see above, the data has 65 columns of mostly categorical data. For example, the first column appears to allow for Yes and No responses only. Let’s verify by using the pandas.Series.unique method to see what unique values are in the Do you celebrate Thanksgiving? column of data:

正如您在上面看到的,数据具有65列,主要是分类数据。 例如,第一列似乎仅允许YesNo响应。 让我们通过使用pandas.Series.unique方法进行验证,以查看Do you celebrate Thanksgiving?有哪些唯一值Do you celebrate Thanksgiving?data


array(['Yes', 'No'], dtype=object)

We can also view all the column names to see all of the survey questions. We’ll truncate the output below to save you from having to scroll:

我们还可以查看所有列名以查看所有调查问题。 我们将截断下面的输出,以免您不得不滚动:

datadata .. columnscolumns [[ 5050 :]
:]

Index(['Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Other (please specify).1',
       'Do you typically pray before or after the Thanksgiving meal?',
       'How far will you travel for Thanksgiving?',
       'Will you watch any of the following programs on Thanksgiving? Please select all that apply. - Macy's Parade',
       'What's the age cutoff at your "kids' table" at Thanksgiving?',
       'Have you ever tried to meet up with hometown friends on Thanksgiving night?',
       'Have you ever attended a "Friendsgiving?"',
       'Will you shop any Black Friday sales on Thanksgiving Day?',
       'Do you work in retail?',
       'Will you employer make you work on Black Friday?',
       'How would you describe where you live?', 'Age', 'What is your gender?',
       'How much total combined money did all members of your HOUSEHOLD earn last year?',
       'US Region'],
      dtype='object')

Using this Thanksgiving survey data, we can answer quite a few interesting questions, like:

使用此感恩节调查数据,我们可以回答很多有趣的问题,例如:

  • Do people in Suburban areas eat more Tofurkey than people in Rural areas?
  • Where do people go to Black Friday sales most often?
  • Is there a correlation between praying on Thanksgiving and income?
  • What income groups are most likely to have homemade cranberry sauce?
  • 郊区的人们吃的糠醛比农村地区的人多吗?
  • 人们最常去哪里买黑色星期五的商品?
  • 在感恩节祈祷和收入之间有关联吗?
  • 哪些收入群体最有可能自制蔓越莓酱?

In order to answer these questions and others, we’ll first need to become familiar with applying, grouping and aggregation in Pandas.

为了回答这些问题和其他问题,我们首先需要熟悉在Pandas中的应用,分组和聚合。

将功能应用于熊猫系列 (Applying functions to Series in pandas)

There are times when we’re using pandas that we want to apply a function to every row or every column in the data. A good example is getting from the values in our What is your gender? column to numeric values. We’ll assign 0 to Male, and 1 to Female.

有时候,当我们使用熊猫时,我们希望对数据的每一行或每一列应用一个函数。 一个很好的例子就是从我们的价值观中得到What is your gender? 列到数值。 我们将指派0Male1Female

Before we dive into transforming the values, let’s confirm that the values in the column are either Male or Female. We can use the pandas.Series.value_counts method to help us with this. We’ll pass the dropna=False keyword argument to also count missing values:

在深入研究转换值之前,让我们确认列中的值是Male还是Female 。 我们可以使用pandas.Series.value_counts方法来帮助我们。 我们将传递dropna=False关键字参数来也计算缺失值:


Female    544
Male      481
NaN        33
Name: What is your gender?, dtype: int64

As you can see, not all of the values are Male or Female. We’ll preserve any missing values in the final output when we transform our column. Here’s a diagram of the input and outputs we need:

如您所见,并非所有值都是MaleFemale 。 转换列时,我们将在最终输出中保留所有缺少的值。 这是我们需要的输入和输出的图表:

+—————–+ +————–+ | What is your | | gender | | gender? | | | +—————–+ +————–+ | | | | | Male | | 0 | | | transform | | +—————–+ column with +————–+ | | apply | | | Female | +————-> | 1 | | | | | +—————–+ +————–+ | | | | | NaN | | NaN | | | | | +—————–+ +————–+ | | | | | Male | | 0 | | | | | +—————–+ +————–+ | | | | | Female | | 1 | | | | | |—————–+ +————–+

+ ——————– + + ————– + | 你是什​​么 | 性别| | 性别? | | | + ——————– + + ————– + | | | | | 男| | 0 | | | 转换| | + ——————– +带有+ ————– + |的列 | 申请| | | 女| + ————-> | 1 | | | | | + ——————– + + ————– + | | | | | NaN | | NaN | | | | | + ——————– + + ————– + | | | | | 男| | 0 | | | | | + ——————– + + ————– + | | | | | 女| | 1 | | | | | | ——————– + + ————– +

We’ll need to apply a custom function to each value in the What is your gender? column to get the output we want. Here’s a function that will do the transformation we want:

我们需要对“ What is your gender?What is your gender?中的每个值应用自定义函数What is your gender? 列以获取所需的输出。 这是一个可以完成我们想要的转换的函数:

import import math

math

def def gender_codegender_code (( gender_stringgender_string ):
    ):
    if if isinstanceisinstance (( gender_stringgender_string , , floatfloat ) ) and and mathmath .. isnanisnan (( gender_stringgender_string ):
        ):
        return return gender_string
    gender_string
    return return intint (( gender_string gender_string == == "Female""Female" )
)

In order to apply this function to each item in the What is your gender? column, we could either write a for loop, and loop across each element in the column, or we could use the pandas.Series.apply method.

为了将此功能应用于What is your gender?What is your gender? 列,我们可以编写一个for循环,并遍历该列中的每个元素,也可以使用pandas.Series.apply方法。

This method will take a function as input, then return a new pandas Series that contains the results of applying the function to each item in the Series. We can assign the result back to a column in the data DataFrame, then verify the results using value_counts:

此方法将以函数为输入,然后返回一个新的pandas系列,其中包含将该函数应用于系列中的每个项目的结果。 我们可以将结果分配回data DataFrame中的一列,然后使用value_counts验证结果:


 1.0    544
 0.0    481
NaN      33
Name: gender, dtype: int64

将函数应用于Pandas中的DataFrames (Applying functions to DataFrames in pandas)

We can use the apply method on DataFrames as well as Series. When we use the pandas.DataFrame.apply method, an entire row or column will be passed into the function we specify. By default, apply will work across each column in the DataFrame. If we pass the axis=1 keyword argument, it will work across each row.

我们可以在DataFrames以及Series上使用apply方法。 当我们使用pandas.DataFrame.apply方法时,整行或整列将传递到我们指定的函数中。 默认情况下, apply将跨DataFrame中的每一列工作。 如果我们传递axis=1关键字参数,它将适用于每一行。

In the below example, we check the data type of each column in data using a lambda function. We also call the head method on the result to avoid having too much output:

在下面的例子中,我们检查每一列中的数据类型data使用lambda函数 。 我们还对结果调用head方法,以避免输出过多:

datadata .. applyapply (( lambda lambda xx : : xx .. dtypedtype )) .. headhead ()
()

RespondentID                                                                             object
Do you celebrate Thanksgiving?                                                           object
What is typically the main dish at your Thanksgiving dinner?                             object
What is typically the main dish at your Thanksgiving dinner? - Other (please specify)    object
How is the main dish typically cooked?                                                   object
dtype: object

使用申请方法清理收入 (Using the apply method to clean up income)

We can now use what we know about the apply method to clean up the How much total combined money did all members of your HOUSEHOLD earn last year? column. Cleaning up the income column will allow us to go from string values to numeric values. First, let’s see all the unique values that are in the How much total combined money did all members of your HOUSEHOLD earn last year? column:

现在,我们可以使用对apply方法了解的方法来清理How much total combined money did all members of your HOUSEHOLD earn last year? 柱。 清理收入列将使我们能够从字符串值转换为数字值。 首先,让我们看看How much total combined money did all members of your HOUSEHOLD earn last year? 柱:


$25,000 to $49,999      180
Prefer not to answer    136
$50,000 to $74,999      135
$75,000 to $99,999      133
$100,000 to $124,999    111
$200,000 and up          80
$10,000 to $24,999       68
$0 to $9,999             66
$125,000 to $149,999     49
$150,000 to $174,999     40
NaN                      33
$175,000 to $199,999     27
Name: How much total combined money did all members of your HOUSEHOLD earn last year?, dtype: int64

Looking at this, there are 4 different patterns for the values in the column:

对此,列中的值有4种不同的模式:

  • X to Y – an example is $25,000 to $49,999.
    • We can convert this to a numeric value by extracting the numbers and averaging them.
  • NaN
    • We’ll preserve NaN values, and not convert them at all.
  • X and up – an example is $200,000 and up.
    • We can convert this to a numeric value by extracting the number.
  • Prefer not to answer
    • We’ll turn this into an NaN value.
  • X to Y –例如$25,000 to $49,999
    • 我们可以通过提取数字并将其取平均值来将其转换为数值。
  • NaN
    • 我们将保留NaN值,而不完全转换它们。
  • X and up –例如$200,000 and up
    • 我们可以通过提取数字将其转换为数值。
  • Prefer not to answer
    • 我们将其转换为NaN值。

Here is how we want the transformations to work:

这是我们希望转换起作用的方式:

+—————–+ +————–+ | How much total | | income | | combined … | | | +—————–+ +————–+ | $25,000 | | | | to | | 37499.5 | | $49,999 | transform | | +—————–+ column with +————–+ | Prefer | apply | | | not to not to | ————–> | NaN | | answer | | | +—————–+ +————–+ | | | | | NaN | | NaN | | | | | +—————–+ +————–+ | $200,000 | | | | and up | | 200000 | | | | | +—————–+ +————–+ | $175,000 | | | | to | | 187499.5 | | $199,999 | | | |—————–+ +————–+

+ ——————– + + ————– + | 总计多少 | 收入| | 结合…| | | + ——————– + + ————– + | $ 25,000 | | | | 到| | 37499.5 | | $ 49,999 | 转换| | + ——————– +带有+ ————– + |的列 更喜欢 申请| | | 不不 ——————> | NaN | | 回答| | | + ——————– + + ————– + | | | | | NaN | | NaN | | | | | + ——————– + + ————– + | $ 200,000 | | | | 及以上| | 200000 | | | | | + ——————– + + ————– + | $ 175,000 | | | | 到| | 187499.5 | | $ 199,999 | | | | ——————– + + ————– +

We can write a function that covers all of these cases. In the below function, we:

我们可以编写一个涵盖所有这些情况的函数。 在下面的函数中,我们:

  • Take a string called value as input.
  • Check to see if value is $200,000 and up, and return 200000 if so.
  • Check if value is Prefer not to answer, and return NaN if so.
  • Check if value is NaN, and return NaN if so.
  • Clean up value by removing any dollar signs or commas.
  • Split the string to extract the incomes, then average them.
  • 以一个名为value的字符串作为输入。
  • 检查value是否为$200,000 and up ,如果是,则返回200000
  • 检查value是否为Prefer not to answer ,如果是,则返回NaN
  • 检查value是否为NaN ,如果是,则返回NaN
  • 通过删除任何美元符号或逗号来清理value
  • 分割字符串以提取收入,然后将其平均。
import import numpy numpy as as np

np

def def clean_incomeclean_income (( valuevalue ):
    ):
    if if value value == == "$200,000 and up""$200,000 and up" :
        :
        return return 200000
    200000
    elif elif value value == == "Prefer not to answer""Prefer not to answer" :
        :
        return return npnp .. nan
    nan
    elif elif isinstanceisinstance (( valuevalue , , floatfloat ) ) and and mathmath .. isnanisnan (( valuevalue ):
        ):
        return return npnp .. nan
    nan
    value value = = valuevalue .. replacereplace (( ",""," , , """" )) .. replacereplace (( "$""$" , , """" )
    )
    income_highincome_high , , income_low income_low = = valuevalue .. splitsplit (( " to "" to " )
    )
    return return (( intint (( income_highincome_high ) ) + + intint (( income_lowincome_low )) )) / / 2
2

喜欢这篇文章吗? 使用Dataquest学习数据科学! (Enjoying this post? Learn data science with Dataquest!)

  • Learn from the comfort of your browser.
  • Work with real-life data sets.
  • Build a portfolio of projects.
  • 从舒适的浏览器中学习。
  • 处理实际数据集。
  • 建立项目组合。

After creating the function, we can apply it to the How much total combined money did all members of your HOUSEHOLD earn last year? column:

创建函数后,我们可以将其应用于How much total combined money did all members of your HOUSEHOLD earn last year? 柱:


0     87499.5
1     62499.5
2      4999.5
3    200000.0
4    112499.5
Name: income, dtype: float64

用熊猫分组数据 (Grouping data with pandas)

Now that we’ve covered applying functions, we can move on to grouping data using pandas. When performing data analysis, it’s often useful to explore only a subset of the data. For example, what if we want to compare income between people who tend to eat homemade cranberry sauce for Thanksgiving vs people who eat canned cranberry sauce? First, let’s see what the unique values in the column are:

既然我们已经介绍了应用函数,我们就可以继续使用熊猫对数据进行分组了。 在执行数据分析时,仅浏览一部分数据通常很有用。 例如,如果我们想比较倾向于吃自制蔓越莓酱的人与吃罐头蔓越莓酱的人的收入,该怎么办? 首先,让我们看看该列中的唯一值是:

datadata [[ "What type of cranberry saucedo you typically have?""What type of cranberry saucedo you typically have?" ]] .. value_countsvalue_counts ()
()

Canned                    502
Homemade                  301
None                      146
Other (please specify)     25
Name: What type of cranberry saucedo you typically have?, dtype: int64

We can now filter data to get two DataFrames that only contain rows where the What type of cranberry saucedo you typically have? is Canned or Homemade, respectively:

现在,我们可以过滤data以获得两个仅包含行的data框, What type of cranberry saucedo you typically have? 分别是CannedHomemade

Finally, we can use the pandas.Series.mean method to find the average income in homemade and canned:

最后,我们可以使用pandas.Series.mean方法找到homemadecanned的平均收入:

printprint (( homemadehomemade [[ "income""income" ]] .. meanmean ())
())
printprint (( cannedcanned [[ "income""income" ]] .. meanmean ())
())

94878.1072874
83823.4034091

We get our answer, but it took more lines of code than it should have. What if we now want to compute the average income for people who didn’t have cranberry sauce?

我们得到了答案,但是花了更多的代码行。 如果我们现在想计算没有酸果蔓酱的人的平均收入怎么办?

An easier way to find groupwise summary statistics with pandas is to use the pandas.DataFrame.groupby method. This method will split a DataFrame into groups based on a column or set of columns. We’ll then be able to perform computations on each group.

使用pandas查找分组汇总统计信息的更简单方法是使用pandas.DataFrame.groupby方法。 此方法将根据一列或一组列将DataFrame分为几组。 然后,我们将能够对每个组执行计算。

Here’s how splitting data based on the What type of cranberry saucedo you typically have? column would look:

这是如何根据What type of cranberry saucedo you typically have?拆分data的方式What type of cranberry saucedo you typically have? 列将如下所示:

groups data Split up based on the value of the “What type of +————+————-+ +———–+—————–+ cranberry sauce” | | | | income | What type of | column | 200000 | Homemade | | | cranberry sauce | +———————–> | | | +—————————–+ | +————————–+ | | | | | | | | 200000 | Homemade | | | 187499.5 | Homemade | | | | | | | | +—————————–+ | +————+————-+ | | | | | 4999.5 | Canned | | | | | | +————+————–+ +—————————–+ | | | | | | | +———————–> | NaN | None | | 187499.5 | Homemade | | | | | | | | | +————+————–+ +—————————–+ | | | | | | NaN | None | | | | | | +—————————–+ | | | | | +————+————–+ | 200000 | Canned | | | | | | | | | | 4999.5 | Canned | +———–+—————–+ +———————–> | | | +—————————+ | | | | 200000 | Canned | | | | +————+————–+

分组数据根据“什么类型的+ ———— + ————- ++ ———— ++ —————— ++蔓越莓酱”的值进行拆分| | | | 收入| 什么类型的 栏| 200000 | 自制| | | 酸果蔓酱| + ——————————> | | + ———————————— ++ + —————————— ++ | | | | | | | 200000 | 自制| | | 187499.5 | 自制| | | | | | | | + ———————————— ++ + ———— ++ ————- + | | | | | 4999.5 | 罐装| | | | | | + ———— ++ ————– + + ————————- ++ | | | | | | | + ——————————> NaN | 无| | 187499.5 | 自制| | | | | | | | | + ———— ++ ————– + + ————————- ++ | | | | | | NaN | 无| | | | | | + ———————————— ++ | | | | + —————— + ————– + | 200000 | 罐装| | | | | | | | | | 4999.5 | 罐装| + —————— + ———————— + + ——————-> | | + —————————— + | | | | 200000 | 罐装| | | | + —————— + ————– +

Note how each resulting group only has a single unique value in the What type of cranberry saucedo you typically have? column. One group is created for each unique value in the column we choose to group by.

请注意What type of cranberry saucedo you typically have?What type of cranberry saucedo you typically have?中,每个结果组如何只有一个唯一的值What type of cranberry saucedo you typically have? 柱。 我们选择分组的列中的每个唯一值都会创建一个分组。

Let’s create groups from the What type of cranberry saucedo you typically have? column:

让我们What type of cranberry saucedo you typically have?What type of cranberry saucedo you typically have?创建组What type of cranberry saucedo you typically have? 柱:


<pandas.core.groupby.DataFrameGroupBy object at 0x10a22cc50>

As you can see above, the groupby method returns a DataFrameGroupBy object. We can call the pandas.GroupBy.groups method to see what value for the What type of cranberry saucedo you typically have? column is in each group:

如上所示, groupby方法返回一个DataFrameGroupBy对象。 我们可以调用pandas.GroupBy.groups方法来查看What type of cranberry saucedo you typically have?What type of cranberry saucedo you typically have? 列在每个组中:

groupedgrouped .. groups
groups

{'Canned': Int64Index([   4,    6,    8,   11,   12,   15,   18,   19,   26,   27,
             ...
             1040, 1041, 1042, 1044, 1045, 1046, 1047, 1051, 1054, 1057],
            dtype='int64', length=502),
 'Homemade': Int64Index([   2,    3,    5,    7,   13,   14,   16,   20,   21,   23,
             ...
             1016, 1017, 1025, 1027, 1030, 1034, 1048, 1049, 1053, 1056],
            dtype='int64', length=301),
 'None': Int64Index([   0,   17,   24,   29,   34,   36,   40,   47,   49,   51,
             ...
              980,  981,  997, 1015, 1018, 1031, 1037, 1043, 1050, 1055],
            dtype='int64', length=146),
 'Other (please specify)': Int64Index([   1,    9,  154,  216,  221,  233,  249,  265,  301,  336,  380,
              435,  444,  447,  513,  550,  749,  750,  784,  807,  860,  872,
              905, 1000, 1007],
            dtype='int64')}

We can call the pandas.GroupBy.size method to see how many rows are in each group. This is equivalent to the value_counts method on a Series:

我们可以调用pandas.GroupBy.size方法来查看每个组中有多少行。 这等效于Series上的value_counts方法:


What type of cranberry saucedo you typically have?
Canned                    502
Homemade                  301
None                      146
Other (please specify)     25
dtype: int64

We can also use a loop to manually iterate through the groups:

我们还可以使用循环来手动遍历各组:

for for namename , , group group in in groupedgrouped :
    :
    printprint (( namename )
    )
    printprint (( groupgroup .. shapeshape )
    )
    printprint (( typetype (( groupgroup ))
))

Canned
(502, 67)
<class 'pandas.core.frame.DataFrame'>
Homemade
(301, 67)
<class 'pandas.core.frame.DataFrame'>
None
(146, 67)
<class 'pandas.core.frame.DataFrame'>
Other (please specify)
(25, 67)
<class 'pandas.core.frame.DataFrame'>

As you can see above, each group is a DataFrame, and you can use any normal DataFrame methods on it.

如上所示,每个组都是一个DataFrame,并且可以在其上使用任何常规DataFrame方法。

We can also extract a single column from a group. This will allow us to perform further computations just on that specific column:

我们还可以从组中提取单个列。 这将使我们能够仅在该特定列上执行进一步的计算:


<pandas.core.groupby.SeriesGroupBy object at 0x1081ef390>

As you can see above, this gives us a SeriesGroupBy object. We can then call the normal methods we can call on a DataFrameGroupBy object:

如您在上面看到的,这给了我们一个SeriesGroupBy对象。 然后,我们可以调用可以在DataFrameGroupBy对象上调用的常规方法:

groupedgrouped [[ "income""income" ]] .. sizesize ()
()

What type of cranberry saucedo you typically have?
Canned                    502
Homemade                  301
None                      146
Other (please specify)     25
dtype: int64

分组汇总值 (Aggregating values in groups)

If all we could do was split a DataFrame into groups, it wouldn’t be of much use. The real power of groups is in the computations we can do after creating groups. We do these computations through the pandas.GroupBy.aggregate method, which we can abbreviate as agg. This method allows us to perform the same computation on every group.

如果我们能做的只是将DataFrame分成几组,那么它就没有多大用处了。 组的真正力量在于创建组后可以进行的计算。 我们通过pandas.GroupBy.aggregate方法执行这些计算,我们可以将其缩写为agg 。 这种方法使我们可以对每个组执行相同的计算。

For example, we could find the average income for people who served each type of cranberry sauce for Thanksgiving (Canned, Homemade, None, etc).

例如,我们可以找到为感恩节食用每种蔓越莓酱( CannedHomemadeNone等)的人们的平均收入。

In the below code, we:

在下面的代码中,我们:

  • Extract just the income column from grouped, so we don’t find the average of every column.
  • Call the agg method with np.mean as input.
    • This will compute the mean for each group, then combine the results from each group.
  • grouped仅提取income列,因此我们没有找到每一列的平均值。
  • 使用np.mean作为输入来调用agg方法。
    • 这将计算每组的平均值,然后合并每组的结果。

What type of cranberry saucedo you typically have?
Canned                    83823.403409
Homemade                  94878.107287
None                      78886.084034
Other (please specify)    86629.978261
Name: income, dtype: float64

If we left out only selecting the income column, here’s what we’d get:

如果我们只选择了income列,那么将得到以下结果:

groupedgrouped .. aggagg (( npnp .. meanmean )
)
RespondentID 受访者编号 gender 性别 income 收入
What type of cranberry saucedo you typically have? 您通常吃哪种类型的酸果蔓酱?
Canned 罐装 4336699416 4336699416 0.552846 0.552846 83823.403409 83823.403409
Homemade 自制 4336792040 4336792040 0.533101 0.533101 94878.107287 94878.107287
None 没有 4336764989 4336764989 0.517483 0.517483 78886.084034 78886.084034
Other (please specify) 其他(请注明) 4336763253 4336763253 0.640000 0.640000 86629.978261 86629.978261

The above code will find the mean for each group for every column in data. However, most columns are string columns, not integer or float columns, so pandas didn’t process them, since calling np.mean on them raised an error.

上面的代码将找到data每一列的每个组的平均值。 但是,大多数列是字符串列,而不是整数列或浮点列,因此熊猫np.mean它们,因为对它们调用np.mean引发错误。

绘制汇总结果 (Plotting the results of aggregation)

We can make a plot using the results of our agg method. This will create a bar chart that shows the average income of each category.

我们可以使用agg方法的结果进行绘制。 这将创建一个条形图,显示每个类别的平均收入。

In the below code, we:

在下面的代码中,我们:


<matplotlib.axes._subplots.AxesSubplot at 0x109ebacc0>

聚合多列 (Aggregating with multiple columns)

We can call groupby with multiple columns as input to get more granular groups. If we use the What type of cranberry saucedo you typically have? and What is typically the main dish at your Thanksgiving dinner? columns as input, we’ll be able to find the average income of people who eat Homemade cranberry sauce and Tofurkey, for example:

我们可以使用多列作为输入来调用groupby以获得更多的细化组。 如果我们使用What type of cranberry saucedo you typically have? What is typically the main dish at your Thanksgiving dinner? 列作为输入,我们将能够找到吃Homemade蔓越莓酱和Tofurkey的人的平均收入,例如:

grouped grouped = = datadata .. groupbygroupby ([([ "What type of cranberry saucedo you typically have?""What type of cranberry saucedo you typically have?" , , "What is typically the main dish at your Thanksgiving dinner?""What is typically the main dish at your Thanksgiving dinner?" ])
])
groupedgrouped .. aggagg (( npnp .. meanmean )
)
RespondentID 受访者编号 gender 性别 income 收入
What type of cranberry saucedo you typically have? 您通常吃哪种类型的酸果蔓酱? What is typically the main dish at your Thanksgiving dinner? 感恩节晚餐的主菜通常是什么?
Canned 罐装 Chicken4336354418 4336354418 0.333333 0.333333 80999.600000 80999.600000
Ham/Pork 火腿/猪肉 4336757434 4336757434 0.642857 0.642857 77499.535714 77499.535714
I don’t know 我不知道 4335987430 4335987430 0.000000 0.000000 4999.500000 4999.500000
Other (please specify) 其他(请注明) 4336682072 4336682072 1.000000 1.000000 53213.785714 53213.785714
Roast beef 烤牛肉 4336254414 4336254414 0.571429 0.571429 25499.500000 25499.500000
Tofurkey 托福基 4337156546 4337156546 0.714286 0.714286 100713.857143 100713.857143
Turkey 火鸡 4336705225 4336705225 0.544444 0.544444 85242.682045 85242.682045
Homemade 自制 Chicken4336539693 4336539693 0.750000 0.750000 19999.500000 19999.500000
Ham/Pork 火腿/猪肉 4337252861 4337252861 0.250000 0.250000 96874.625000 96874.625000
I don’t know 我不知道 4336083561 4336083561 1.000000 1.000000 NaN N
Other (please specify) 其他(请注明) 4336863306 4336863306 0.600000 0.600000 55356.642857 55356.642857
Roast beef 烤牛肉 4336173790 4336173790 0.000000 0.000000 33749.500000 33749.500000
Tofurkey 托福基 4336789676 4336789676 0.666667 0.666667 57916.166667 57916.166667
Turducken Turducken 4337475308 4337475308 0.500000 0.500000 200000.000000 200000.000000
Turkey 火鸡 4336790802 4336790802 0.531008 0.531008 97690.147982 97690.147982
None 没有 Chicken4336150656 4336150656 0.500000 0.500000 11249.500000 11249.500000
Ham/Pork 火腿/猪肉 4336679896 4336679896 0.444444 0.444444 61249.500000 61249.500000
I don’t know 我不知道 4336412261 4336412261 0.500000 0.500000 33749.500000 33749.500000
Other (please specify) 其他(请注明) 4336687790 4336687790 0.600000 0.600000 119106.678571 119106.678571
Roast beef 烤牛肉 4337423740 4337423740 0.000000 0.000000 162499.500000 162499.500000
Tofurkey 托福基 4336950068 4336950068 0.500000 0.500000 112499.500000 112499.500000
Turducken Turducken 4336738591 4336738591 0.000000 0.000000 NaN N
Turkey 火鸡 4336784218 4336784218 0.523364 0.523364 74606.275281 74606.275281
Other (please specify) 其他(请注明) Ham/Pork 火腿/猪肉 4336465104 4336465104 1.000000 1.000000 87499.500000 87499.500000
Other (please specify) 其他(请注明) 4337335395 4337335395 0.000000 0.000000 124999.666667 124999.666667
Tofurkey 托福基 4336121663 4336121663 1.000000 1.000000 37499.500000 37499.500000
Turkey 火鸡 4336724418 4336724418 0.700000 0.700000 82916.194444 82916.194444

As you can see above, we get a nice table that shows us the mean of each column for each group. This enables us to find some interesting patterns, such as:

如您在上面看到的,我们得到一个漂亮的表,该表向我们显示了每个组的每一列的平均值。 这使我们能够找到一些有趣的模式,例如:

  • People who have Turducken and Homemade cranberry sauce seem to have high household incomes.
  • People who eat Canned cranberry sauce tend to have lower incomes, but those who also have Roast Beef have the lowest incomes.
  • It looks like there’s one person who has Canned cranberry sauce and doesn’t know what type of main dish he’s having.
  • 拥有TurduckenHomemade蔓越莓酱的人似乎家庭收入很高。
  • 吃酸果蔓酱Canned人收入较低,但是也有Roast Beef人收入最低。
  • 好像有一个人Canned酸果蔓酱,却不知道他吃的是哪种主菜。

整合多种功能 (Aggregating with multiple functions)

We can also perform aggregation with multiple functions. This enables us to calculate the mean and standard deviation of a group, for example. In the below code, we find the sum, standard deviation, and mean of each group in the income column:

我们还可以使用多种功能执行汇总。 例如,这使我们能够计算一组的平均值和标准偏差。 在下面的代码中,我们在income列中找到每个组的总和,标准差和均值:

mean 意思 sumstd 性病
What type of cranberry saucedo you typically have? 您通常吃哪种类型的酸果蔓酱? What is typically the main dish at your Thanksgiving dinner? 感恩节晚餐的主菜通常是什么?
Canned 罐装 Chicken80999.600000 80999.600000 404998.0 404998.0 75779.481062 75779.481062
Ham/Pork 火腿/猪肉 77499.535714 77499.535714 1084993.5 1084993.5 56645.063944 56645.063944
I don’t know 我不知道 4999.500000 4999.500000 4999.5 4999.5 NaN N
Other (please specify) 其他(请注明) 53213.785714 53213.785714 372496.5 372496.5 29780.946290 29780.946290
Roast beef 烤牛肉 25499.500000 25499.500000 127497.5 127497.5 24584.039538 24584.039538
Tofurkey 托福基 100713.857143 100713.857143 704997.0 704997.0 61351.484439 61351.484439
Turkey 火鸡 85242.682045 85242.682045 34182315.5 34182315.5 55687.436102 55687.436102
Homemade 自制 Chicken19999.500000 19999.500000 59998.5 59998.5 16393.596311 16393.596311
Ham/Pork 火腿/猪肉 96874.625000 96874.625000 387498.5 387498.5 77308.452805 77308.452805
I don’t know 我不知道 NaN N NaN N NaN N

使用适用于团体 (Using apply on groups)

One of the limitations of aggregation is that each function has to return a single number. While we can perform computations like finding the mean, we can’t for example, call value_counts to get the exact count of a category. We can do this using the pandas.GroupBy.apply method. This method will apply a function to each group, then combine the results.

聚合的局限性之一是每个函数必须返回一个数字。 尽管我们可以执行诸如求平均值的计算,但是例如不能调用value_counts来获取类别的确切计数。 我们可以使用pandas.GroupBy.apply方法做到这一点。 此方法将对每个组应用一个函数,然后合并结果。

In the below code, we’ll apply value_counts to find the number of people who live in each area type (Rural, Suburban, etc) who eat different kinds of main dishes for Thanksgiving:

在下面的代码中,我们将应用value_counts来查找每种地区类型( RuralSuburban等)中生活在感恩节期间吃不同种类的主菜的人数:

grouped grouped = = datadata .. groupbygroupby (( "How would you describe where you live?""How would you describe where you live?" )[)[ "What is typically the main dish at your Thanksgiving dinner?""What is typically the main dish at your Thanksgiving dinner?" ]
]
groupedgrouped .. applyapply (( lambda lambda xx :: xx .. value_countsvalue_counts ())
())

How would you describe where you live?                        
Rural                                   Turkey                    189
                                        Other (please specify)      9
                                        Ham/Pork                    7
                                        I don't know                3
                                        Tofurkey                    3
                                        Turducken                   2
                                        Chicken                     2
                                        Roast beef                  1
Suburban                                Turkey                    449
                                        Ham/Pork                   17
                                        Other (please specify)     13
                                        Tofurkey                    9
                                        Roast beef                  3
                                        Chicken                     3
                                        Turducken                   1
                                        I don't know                1
Urban                                   Turkey                    198
                                        Other (please specify)     13
                                        Tofurkey                    8
                                        Chicken                     7
                                        Roast beef                  6
                                        Ham/Pork                    4
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64

The above table shows us that people who live in different types of areas eat different Thanksgiving main dishes at about the same rate.

上表显示,居住在不同类型地区的人们以相同的比率吃着不同的感恩节主菜。

进一步阅读 (Further reading)

In this tutorial, we learned how to use pandas to group data, and calculate results. We learned several techniques for manipulating groups and finding patterns.

在本教程中,我们学习了如何使用熊猫对数据进行分组和计算结果。 我们学习了几种用于操纵群体和寻找模式的技术。

翻译自: https://www.pybloggers.com/2016/12/pandas-tutorial-data-analysis-with-python-part-2/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值