Group Summary Statistics-CSDN博客

2019独角兽企业重金招聘Python工程师标准>>>

1: Statistics By Group

In the last mission, we computed summary statistics across columns using SQL. In many cases, though, we want to drill down even more, and compute summary statistics per-group. In this mission, we'll explore how to calculate more granular summary statistics. We'll switch back to writing SQL queries directly instead of using Python so we can focus more on the SQL syntax.

We'll be working with a dataset on jobs that's stored in the recent_grads table of jobs.db. Each row represents a single college major, and contains information about post-graduation employment of students who had the major. You can find out more about the datasethere. Here are descriptions of a few of the columns (the whole dataset has 21 columns):

Rank - The numerical rank of the major by post-graduation median earnings.
Major_code - The numerical code of the major.
Major - The description of the major.
Major_category - The category of the major.
Total - The total number of people who studied the major.
Men - The number of men who studied the major.
Women - The number of women who studied the major.
ShareWomen - The share of women (from 0 to 1) who studied the major.
Employed - The number of people who studied the major and were employed post-graduation.

Here are the first few rows and columns in the dataset:

Rank	Major_code	Major	Major_category	Total	Sample_size	Men	Women	ShareWomen	Employed
1	2419	PETROLEUM ENGINEERING	Engineering	2339	36	2057	282	0.120564	1976
2	2416	MINING AND MINERAL ENGINEERING	Engineering	756	7	679	77	0.101852	640
3	2415	METALLURGICAL ENGINEERING	Engineering	856	3	725	131	0.153037	648
4	2417	NAVAL ARCHITECTURE AND MARINE ENGINEERING	Engineering	1258	16	1123	135	0.107313	758
5	2405	CHEMICAL ENGINEERING	Engineering	32260	289	21239	11021	0.341631	25694

As we go through this mission, we'll drill down and compute summary statistics by group, and answer questions like:

What's the share of women in each major category?
What major categories are the most likely to be employed post-graduation?
What percentage of people in each major category get a low wage job?

First, let's explore the data.

Instructions

Write a SQL query that displays all of the columns and the first 5 rows of therecent_grads table.

select * from recent_grads limit 5 ;

2: The GROUP BY Statement

The GROUP BY SQL statement allows us to compute summary statistics for each group in a dataset. Groups are defined by the unique values in a column or set of columns. The unique values are equivalent to using the DISTINCT statement. To illustrate, we can figure out the total number of people employed in each major category with the following query:

SELECT SUM(Employed)

FROM recent_grads

GROUP BY Major_category;

This will give us the count of total people employed for each major category (output is truncated for length):

SUM(Employed)
66943
288114
302797

The above displays aggregate counts of the Employed column for each Major_category. Unfortunately, the above doesn't have any indication of which major category each row refers to. We can fix this by selecting the Major_category column as well:

SELECT Major_category, SUM(Employed)

FROM recent_grads

GROUP BY Major_category;

This results in more clear output:

Major_category	SUM(Employed)
Agriculture & Natural Resources	66943
Arts	288114
Biology & Life Science	302797

This works because the GROUP BY statement splits each category into groups, then figures out what values to display. Here's a diagram of how GROUP BY splits the data into groups, using a small sample from the recent_grads table:

Major_categoryEmployedArts2914SELECTEmployed,Major_category,SUM(Employed)Agriculture3149FROMrecent_gradsGROUPBYMajor_category;Arts36165Agriculture1290Group1Group2Major_category="Arts"Major_category="Agriculture"Arts2914Agriculture3149Arts36165Agriculture1290

For each group, the GROUP BY statement queries each column and aggregation function mentioned after the SELECT statement:

SELECTEmployed,Major_category,SUM(Employed)FROMrecent_gradsGROUPBYMajor_category;Group1Major_category="Arts"Arts2914Arts36165EmployedMajor_categorySUM(Employed)36165Arts2914+36165=39075

If a column is selected, the SQL engine will use the last value for that column in the group. If an aggregation function is selected, the SQL engine will compute the value for that aggregation function across the group.

For the query in the diagram, we end up with the following result:

Employed	Major_category	SUM(Employed)
1290	Agriculture	4439
36165	Arts	39075

Instructions

Use the SELECT statement to select the following columns and aggregates in a query:
- Major_category
- AVG(ShareWomen)
Use the GROUP BY statement to group the query by theMajor_category column.

select Major_category,AVG(ShareWomen) from recent_grads group by Major_category

3: The AS Statement

You may have noticed that in the last screen, specifying AVG(ShareWomen) caused the column to show up with that name in the results. This can often be unintuitive, and make it hard to work with the results of SQL queries. In order to help with this, when we select columns, we can rename them using the AS statement. Here's an example:

SELECT AVG(ShareWomen) AS average_female_share

FROM recent_grads;

This query will result in the following output:

average_female_share
0.5225502029537575

Instructions

Write a query that selects the following items, in order, and renames them with AS:
- SUM(Men) as total_men.
- SUM(Women) astotal_women

select SUM(Men) as total_men,SUM(Women) as total_women from recent_grads

4: Practice: Using GROUP BY

Now that we understand the GROUP BY statement better, let's practice with it to compute some summary statistics by group in therecent_grads table.

Instructions

Find the percentage of graduates that are employed in each major category.
- Use the SELECT statement to select the following columns and aggregates in a query:
  - Major_category
  - AVG(Employed) / AVG(Total) asshare_employed
- Use the GROUP BYstatement to group the query by theMajor_category column.

select Major_category,AVG(Employed) / AVG(Total) as share_employed
from recent_grads
group by
Major_category

5: The HAVING Statement

Sometimes, we'll want to select a subset of rows after we perform a GROUP BY query. For instance, in the last screen, we may have only wanted to select rows where share_employed is greater than .8. We can't use the WHERE clause to do this, because share_employedisn't a column in recent_grads. Instead, it's a virtual column that's generated by the GROUP BY statement.

In cases like this, where we want to filter on a generated column, we can use the HAVING statement. Here's an example:

SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed

FROM recent_grads

GROUP BY Major_category

HAVING share_employed > .8;

Note how we're able to use the same column name in the HAVING statement that we specified using the AS statement. SQL allows us to use custom column names in subsequent statements, including HAVING and WHERE. The above statement will result in the following output:

Major_category	share_employed
Agriculture & Natural Resources	0.8369862842425075
Arts	0.8067482429367457
Business	0.8359659576036412
Communications & Journalism	0.8422291333949735

Note how only categories where share_employed is greater than .8 are shown above. This is because the HAVING statement filters out the other rows.

Instructions

Find all the major categories where the share of graduates with low wage jobs is greater the .1.
- Use the SELECT statement to select the following columns and aggregates in a query:
  - Major_category
  - AVG(Low_wage_jobs) / AVG(Total) asshare_low_wage
- Use the GROUP BYstatement to group the query by theMajor_category column.
- Use the HAVING statement to only select rows whereshare_low_wage is greater than .1

select Major_category,AVG(Low_wage_jobs) / AVG(Total) as share_low_wage
from recent_grads
group by Major_category
HAVING share_low_wage > .1

6: The ROUND Function

In the last screen, the percentages we got as results were very long and hard to read -- they looked like 0.16833085991095678. We can use the SQL ROUND function to round values when we query them.

SELECT Major_category, ROUND(ShareWomen, 2) AS rounded_share_women

FROM recent_grads;

The above SQL query will round the ShareWomen column to 2 decimal places, and display the results. Here's a truncated view of the results:

Major_category	rounded_share_women
Engineering	0.12
Engineering	0.1

By specifying different values to the ROUND function, such as ROUND(ShareWomen, 3), we can round to different numbers of decimal places.

Instructions

Write a SQL query that returns the following columns ofrecent_grads, in the specified order:
- ShareWomen rounded to 4decimal places.
- Major_category
Limit the results to 10 rows

SELECT ROUND(ShareWomen, 4), Major_category FROM recent_grads LIMIT 10;

7: Nested Functions

In a previous screen, we used the following query:

SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed

FROM recent_grads

GROUP BY Major_category

HAVING share_employed > .8;

This displayed very long fractional values for share_employed. We can update this with the ROUND function to round the results to 3decimal places:

SELECT Major_category, ROUND(AVG(Employed) / AVG(Total), 3) AS share_employed

FROM recent_grads

GROUP BY Major_category

HAVING share_employed > .8;

This will result in the following:

Major_category	share_employed
Agriculture & Natural Resources	0.837
Arts	0.807

Instructions

Use the SELECT statement to select the following columns and aggregates in a query:
- Major_category
- AVG(College_jobs) / AVG(Total) asshare_degree_jobs
  - Use the ROUNDfunction to roundshare_degree_jobsto 3 decimal places.
Group the query by theMajor_category column.
Only select rows whereshare_degree_jobs is less than.3.

select Major_category,round(AVG(College_jobs) / AVG(Total),3) as share_degree_jobs
from recent_grads
group by Major_category
HAVING share_degree_jobs < .3

8: Next Steps

In this mission, we covered the GROUP BY and HAVING statements. With these statements, we can calculate powerful summary statistics in SQL quickly. In the next few missions, we'll learn more about working with SQL tables, and inserting and modifying data.

转载于:https://my.oschina.net/Bettyty/blog/747547