1: Statistics By Group
In the last mission, we computed summary statistics across columns using SQL. In many cases, though, we want to drill down even more, and compute summary statistics per-group. In this mission, we'll explore how to calculate more granular summary statistics. We'll switch back to writing SQL queries directly instead of using Python so we can focus more on the SQL syntax.
We'll be working with a dataset on jobs that's stored in the recent_grads
table of jobs.db
. Each row represents a single college major, and contains information about post-graduation employment of students who had the major. You can find out more about the datasethere. Here are descriptions of a few of the columns (the whole dataset has 21
columns):
Rank
- The numerical rank of the major by post-graduation median earnings.Major_code
- The numerical code of the major.Major
- The description of the major.Major_category
- The category of the major.Total
- The total number of people who studied the major.Men
- The number of men who studied the major.Women
- The number of women who studied the major.ShareWomen
- The share of women (from0
to1
) who studied the major.Employed
- The number of people who studied the major and were employed post-graduation.
Here are the first few rows and columns in the dataset:
Rank | Major_code | Major | Major_category | Total | Sample_size | Men | Women | ShareWomen | Employed |
---|---|---|---|---|---|---|---|---|---|
1 | 2419 | PETROLEUM ENGINEERING | Engineering | 2339 | 36 | 2057 | 282 | 0.120564 | 1976 |
2 | 2416 | MINING AND MINERAL ENGINEERING | Engineering | 756 | 7 | 679 | 77 | 0.101852 | 640 |
3 | 2415 | METALLURGICAL ENGINEERING | Engineering | 856 | 3 | 725 | 131 | 0.153037 | 648 |
4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | Engineering | 1258 | 16 | 1123 | 135 | 0.107313 | 758 |
5 | 2405 | CHEMICAL ENGINEERING | Engineering | 32260 | 289 | 21239 | 11021 | 0.341631 | 25694 |
As we go through this mission, we'll drill down and compute summary statistics by group, and answer questions like:
- What's the share of women in each major category?
- What major categories are the most likely to be employed post-graduation?
- What percentage of people in each major category get a low wage job?
First, let's explore the data.
Instructions
Write a SQL query that displays all of the columns and the first 5
rows of therecent_grads
table.
select * from recent_grads limit 5 ;
2: The GROUP BY Statement
The GROUP BY
SQL statement allows us to compute summary statistics for each group in a dataset. Groups are defined by the unique values in a column or set of columns. The unique values are equivalent to using the DISTINCT
statement. To illustrate, we can figure out the total number of people employed in each major category with the following query:
SELECT SUM(Employed)
FROM recent_grads
GROUP BY Major_category;
This will give us the count of total people employed for each major category (output is truncated for length):
SUM(Employed) |
---|
66943 |
288114 |
302797 |
The above displays aggregate counts of the Employed
column for each Major_category
. Unfortunately, the above doesn't have any indication of which major category each row refers to. We can fix this by selecting the Major_category
column as well:
SELECT Major_category, SUM(Employed)
FROM recent_grads
GROUP BY Major_category;
This results in more clear output:
Major_category | SUM(Employed) |
---|---|
Agriculture & Natural Resources | 66943 |
Arts | 288114 |
Biology & Life Science | 302797 |
This works because the GROUP BY
statement splits each category into groups, then figures out what values to display. Here's a diagram of how GROUP BY
splits the data into groups, using a small sample from the recent_grads
table:
Major_categoryEmployedArts2914SELECTEmployed,Major_category,SUM(Employed)Agriculture3149FROMrecent_gradsGROUPBYMajor_category;Arts36165Agriculture1290Group1Group2Major_category="Arts"Major_category="Agriculture"Arts2914Agriculture3149Arts36165Agriculture1290
For each group, the GROUP BY
statement queries each column and aggregation function mentioned after the SELECT
statement:
SELECTEmployed,Major_category,SUM(Employed)FROMrecent_gradsGROUPBYMajor_category;Group1Major_category="Arts"Arts2914Arts36165EmployedMajor_categorySUM(Employed)36165Arts2914+36165=39075
If a column is selected, the SQL engine will use the last value for that column in the group. If an aggregation function is selected, the SQL engine will compute the value for that aggregation function across the group.
For the query in the diagram, we end up with the following result:
Employed | Major_category | SUM(Employed) |
---|---|---|
1290 | Agriculture | 4439 |
36165 | Arts | 39075 |
Instructions
- Use the
SELECT
statement to select the following columns and aggregates in a query:Major_category
AVG(ShareWomen)
- Use the
GROUP BY
statement to group the query by theMajor_category
column.
select Major_category,AVG(ShareWomen) from recent_grads group by Major_category
3: The AS Statement
You may have noticed that in the last screen, specifying AVG(ShareWomen)
caused the column to show up with that name in the results. This can often be unintuitive, and make it hard to work with the results of SQL queries. In order to help with this, when we select columns, we can rename them using the AS
statement. Here's an example:
SELECT AVG(ShareWomen) AS average_female_share
FROM recent_grads;
This query will result in the following output:
average_female_share |
---|
0.5225502029537575 |
Instructions
- Write a query that selects the following items, in order, and renames them with
AS
:SUM(Men)
astotal_men
.SUM(Women)
astotal_women
select SUM(Men) as total_men,SUM(Women) as total_women from recent_grads
4: Practice: Using GROUP BY
Now that we understand the GROUP BY
statement better, let's practice with it to compute some summary statistics by group in therecent_grads
table.
Instructions
- Find the percentage of graduates that are employed in each major category.
- Use the
SELECT
statement to select the following columns and aggregates in a query:Major_category
AVG(Employed) / AVG(Total)
asshare_employed
- Use the
GROUP BY
statement to group the query by theMajor_category
column.
- Use the
select Major_category,AVG(Employed) / AVG(Total) as share_employed
from recent_grads
group by
Major_category
5: The HAVING Statement
Sometimes, we'll want to select a subset of rows after we perform a GROUP BY
query. For instance, in the last screen, we may have only wanted to select rows where share_employed
is greater than .8
. We can't use the WHERE
clause to do this, because share_employed
isn't a column in recent_grads
. Instead, it's a virtual column that's generated by the GROUP BY
statement.
In cases like this, where we want to filter on a generated column, we can use the HAVING
statement. Here's an example:
SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed
FROM recent_grads
GROUP BY Major_category
HAVING share_employed > .8;
Note how we're able to use the same column name in the HAVING
statement that we specified using the AS
statement. SQL allows us to use custom column names in subsequent statements, including HAVING
and WHERE
. The above statement will result in the following output:
Major_category | share_employed |
---|---|
Agriculture & Natural Resources | 0.8369862842425075 |
Arts | 0.8067482429367457 |
Business | 0.8359659576036412 |
Communications & Journalism | 0.8422291333949735 |
Note how only categories where share_employed
is greater than .8
are shown above. This is because the HAVING
statement filters out the other rows.
Instructions
- Find all the major categories where the share of graduates with low wage jobs is greater the
.1
.- Use the
SELECT
statement to select the following columns and aggregates in a query:Major_category
AVG(Low_wage_jobs) / AVG(Total)
asshare_low_wage
- Use the
GROUP BY
statement to group the query by theMajor_category
column. - Use the
HAVING
statement to only select rows whereshare_low_wage
is greater than.1
- Use the
select Major_category,AVG(Low_wage_jobs) / AVG(Total) as share_low_wage
from recent_grads
group by Major_category
HAVING share_low_wage > .1
6: The ROUND Function
In the last screen, the percentages we got as results were very long and hard to read -- they looked like 0.16833085991095678
. We can use the SQL ROUND
function to round values when we query them.
SELECT Major_category, ROUND(ShareWomen, 2) AS rounded_share_women
FROM recent_grads;
The above SQL query will round the ShareWomen
column to 2
decimal places, and display the results. Here's a truncated view of the results:
Major_category | rounded_share_women |
---|---|
Engineering | 0.12 |
Engineering | 0.1 |
By specifying different values to the ROUND
function, such as ROUND(ShareWomen, 3)
, we can round to different numbers of decimal places.
Instructions
- Write a SQL query that returns the following columns of
recent_grads
, in the specified order:ShareWomen
rounded to4
decimal places.Major_category
- Limit the results to
10
rows
SELECT ROUND(ShareWomen, 4), Major_category FROM recent_grads LIMIT 10;
7: Nested Functions
In a previous screen, we used the following query:
SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed
FROM recent_grads
GROUP BY Major_category
HAVING share_employed > .8;
This displayed very long fractional values for share_employed
. We can update this with the ROUND
function to round the results to 3
decimal places:
SELECT Major_category, ROUND(AVG(Employed) / AVG(Total), 3) AS share_employed
FROM recent_grads
GROUP BY Major_category
HAVING share_employed > .8;
This will result in the following:
Major_category | share_employed |
---|---|
Agriculture & Natural Resources | 0.837 |
Arts | 0.807 |
Instructions
- Use the
SELECT
statement to select the following columns and aggregates in a query:Major_category
AVG(College_jobs) / AVG(Total)
asshare_degree_jobs
- Use the
ROUND
function to roundshare_degree_jobs
to 3 decimal places.
- Use the
- Group the query by the
Major_category
column. - Only select rows where
share_degree_jobs
is less than.3
.
select Major_category,round(AVG(College_jobs) / AVG(Total),3) as share_degree_jobs
from recent_grads
group by Major_category
HAVING share_degree_jobs < .3
8: Next Steps
In this mission, we covered the GROUP BY
and HAVING
statements. With these statements, we can calculate powerful summary statistics in SQL quickly. In the next few missions, we'll learn more about working with SQL tables, and inserting and modifying data.