Group Summary Statistics

1: Statistics By Group

In the last mission, we computed summary statistics across columns using SQL. In many cases, though, we want to drill down even more, and compute summary statistics per-group. In this mission, we'll explore how to calculate more granular summary statistics. We'll switch back to writing SQL queries directly instead of using Python so we can focus more on the SQL syntax.

We'll be working with a dataset on jobs that's stored in the recent_grads table of jobs.db. Each row represents a single college major, and contains information about post-graduation employment of students who had the major. You can find out more about the datasethere. Here are descriptions of a few of the columns (the whole dataset has 21 columns):

  • Rank - The numerical rank of the major by post-graduation median earnings.
  • Major_code - The numerical code of the major.
  • Major - The description of the major.
  • Major_category - The category of the major.
  • Total - The total number of people who studied the major.
  • Men - The number of men who studied the major.
  • Women - The number of women who studied the major.
  • ShareWomen - The share of women (from 0 to 1) who studied the major.
  • Employed - The number of people who studied the major and were employed post-graduation.

Here are the first few rows and columns in the dataset:

RankMajor_codeMajorMajor_categoryTotalSample_sizeMenWomenShareWomenEmployed
12419PETROLEUM ENGINEERINGEngineering23393620572820.1205641976
22416MINING AND MINERAL ENGINEERINGEngineering7567679770.101852640
32415METALLURGICAL ENGINEERINGEngineering85637251310.153037648
42417NAVAL ARCHITECTURE AND MARINE ENGINEERINGEngineering12581611231350.107313758
52405CHEMICAL ENGINEERINGEngineering3226028921239110210.34163125694

As we go through this mission, we'll drill down and compute summary statistics by group, and answer questions like:

  • What's the share of women in each major category?
  • What major categories are the most likely to be employed post-graduation?
  • What percentage of people in each major category get a low wage job?

First, let's explore the data.

Instructions

Write a SQL query that displays all of the columns and the first 5 rows of therecent_grads table.


select * from recent_grads limit 5 ;

 

2: The GROUP BY Statement

The GROUP BY SQL statement allows us to compute summary statistics for each group in a dataset. Groups are defined by the unique values in a column or set of columns. The unique values are equivalent to using the DISTINCT statement. To illustrate, we can figure out the total number of people employed in each major category with the following query:

 

SELECT SUM(Employed)

FROM recent_grads

GROUP BY Major_category;

This will give us the count of total people employed for each major category (output is truncated for length):

SUM(Employed)
66943
288114
302797

The above displays aggregate counts of the Employed column for each Major_category. Unfortunately, the above doesn't have any indication of which major category each row refers to. We can fix this by selecting the Major_category column as well:

 

SELECT Major_category, SUM(Employed)

FROM recent_grads

GROUP BY Major_category;

This results in more clear output:

Major_categorySUM(Employed)
Agriculture & Natural Resources66943
Arts288114
Biology & Life Science302797

This works because the GROUP BY statement splits each category into groups, then figures out what values to display. Here's a diagram of how GROUP BY splits the data into groups, using a small sample from the recent_grads table:

Major_categoryEmployedArts2914SELECTEmployed,Major_category,SUM(Employed)Agriculture3149FROMrecent_gradsGROUPBYMajor_category;Arts36165Agriculture1290Group1Group2Major_category="Arts"Major_category="Agriculture"Arts2914Agriculture3149Arts36165Agriculture1290

For each group, the GROUP BY statement queries each column and aggregation function mentioned after the SELECT statement:

SELECTEmployed,Major_category,SUM(Employed)FROMrecent_gradsGROUPBYMajor_category;Group1Major_category="Arts"Arts2914Arts36165EmployedMajor_categorySUM(Employed)36165Arts2914+36165=39075

If a column is selected, the SQL engine will use the last value for that column in the group. If an aggregation function is selected, the SQL engine will compute the value for that aggregation function across the group.

For the query in the diagram, we end up with the following result:

EmployedMajor_categorySUM(Employed)
1290Agriculture4439
36165Arts39075

Instructions

  • Use the SELECT statement to select the following columns and aggregates in a query:
    • Major_category
    • AVG(ShareWomen)
  • Use the GROUP BY statement to group the query by theMajor_category column.

select Major_category,AVG(ShareWomen) from recent_grads group by Major_category

3: The AS Statement

You may have noticed that in the last screen, specifying AVG(ShareWomen) caused the column to show up with that name in the results. This can often be unintuitive, and make it hard to work with the results of SQL queries. In order to help with this, when we select columns, we can rename them using the AS statement. Here's an example:

 

SELECT AVG(ShareWomen) AS average_female_share

FROM recent_grads;

This query will result in the following output:

average_female_share
0.5225502029537575

Instructions

  • Write a query that selects the following items, in order, and renames them with AS:
    • SUM(Men) as total_men.
    • SUM(Women) astotal_women

select SUM(Men) as total_men,SUM(Women) as total_women from recent_grads

4: Practice: Using GROUP BY

Now that we understand the GROUP BY statement better, let's practice with it to compute some summary statistics by group in therecent_grads table.

Instructions

  • Find the percentage of graduates that are employed in each major category.
    • Use the SELECT statement to select the following columns and aggregates in a query:
      • Major_category
      • AVG(Employed) / AVG(Total) asshare_employed
    • Use the GROUP BYstatement to group the query by theMajor_category column.

select Major_category,AVG(Employed) / AVG(Total) as share_employed
from recent_grads
group by 
Major_category

 

5: The HAVING Statement

Sometimes, we'll want to select a subset of rows after we perform a GROUP BY query. For instance, in the last screen, we may have only wanted to select rows where share_employed is greater than .8. We can't use the WHERE clause to do this, because share_employedisn't a column in recent_grads. Instead, it's a virtual column that's generated by the GROUP BY statement.

In cases like this, where we want to filter on a generated column, we can use the HAVING statement. Here's an example:

 

SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed

FROM recent_grads

GROUP BY Major_category

HAVING share_employed > .8;

Note how we're able to use the same column name in the HAVING statement that we specified using the AS statement. SQL allows us to use custom column names in subsequent statements, including HAVING and WHERE. The above statement will result in the following output:

Major_categoryshare_employed
Agriculture & Natural Resources0.8369862842425075
Arts0.8067482429367457
Business0.8359659576036412
Communications & Journalism0.8422291333949735

Note how only categories where share_employed is greater than .8 are shown above. This is because the HAVING statement filters out the other rows.

Instructions

  • Find all the major categories where the share of graduates with low wage jobs is greater the .1.
    • Use the SELECT statement to select the following columns and aggregates in a query:
      • Major_category
      • AVG(Low_wage_jobs) / AVG(Total) asshare_low_wage
    • Use the GROUP BYstatement to group the query by theMajor_category column.
    • Use the HAVING statement to only select rows whereshare_low_wage is greater than .1

 

 

select Major_category,AVG(Low_wage_jobs) / AVG(Total) as share_low_wage
from recent_grads
group by Major_category
HAVING share_low_wage > .1

 

6: The ROUND Function

In the last screen, the percentages we got as results were very long and hard to read -- they looked like 0.16833085991095678. We can use the SQL ROUND function to round values when we query them.

 

SELECT Major_category, ROUND(ShareWomen, 2) AS rounded_share_women

FROM recent_grads;

The above SQL query will round the ShareWomen column to 2 decimal places, and display the results. Here's a truncated view of the results:

Major_categoryrounded_share_women
Engineering0.12
Engineering0.1

By specifying different values to the ROUND function, such as ROUND(ShareWomen, 3), we can round to different numbers of decimal places.

Instructions

  • Write a SQL query that returns the following columns ofrecent_grads, in the specified order:
    • ShareWomen rounded to 4decimal places.
    • Major_category
  • Limit the results to 10 rows

 


SELECT ROUND(ShareWomen, 4), Major_category FROM recent_grads LIMIT 10;

 

7: Nested Functions

In a previous screen, we used the following query:

 

SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed

FROM recent_grads

GROUP BY Major_category

HAVING share_employed > .8;

This displayed very long fractional values for share_employed. We can update this with the ROUND function to round the results to 3decimal places:

 

SELECT Major_category, ROUND(AVG(Employed) / AVG(Total), 3) AS share_employed

FROM recent_grads

GROUP BY Major_category

HAVING share_employed > .8;

This will result in the following:

Major_categoryshare_employed
Agriculture & Natural Resources0.837
Arts0.807

Instructions

  • Use the SELECT statement to select the following columns and aggregates in a query:
    • Major_category
    • AVG(College_jobs) / AVG(Total) asshare_degree_jobs
      • Use the ROUNDfunction to roundshare_degree_jobsto 3 decimal places.
  • Group the query by theMajor_category column.
  • Only select rows whereshare_degree_jobs is less than.3.

select Major_category,round(AVG(College_jobs) / AVG(Total),3) as share_degree_jobs
from recent_grads
group by Major_category
HAVING share_degree_jobs < .3

 

8: Next Steps

In this mission, we covered the GROUP BY and HAVING statements. With these statements, we can calculate powerful summary statistics in SQL quickly. In the next few missions, we'll learn more about working with SQL tables, and inserting and modifying data.

转载于:https://my.oschina.net/Bettyty/blog/747547

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值