1: Counting In Python
In this mission, we'll be working on computing summary statistics using SQL. There have been many cases in the past few missions when we've needed to count the number of records that match a particular SQL query. So far, we've been able to do this by:
- Performing a SQL query with Python.
- Retrieving the results and storing into a list.
- Finding the length of the list.
This approach works, but also requires quite a bit of code and is fairly slow. As we go through this mission, we'll learn how to count records and more using only SQL.
We'll be working with factbook.db
, a SQLite database that contains information about each country in the world. We'll use a table in the file called facts
. Each row in facts
represents a single country, and contains several columns, including:
name
-- the name of the country.area
-- the total land and sea area of the country.population
-- the population of the country.birth_rate
-- the birth rate of the country.created_at
-- the date the record was created.updated_at
-- the date the record was updated.
Here are the first few rows of facts
:
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate | created_at | updated_at |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 | 2015-11-01 13:19:49.461734 | 2015-11-01 13:19:49.461734 |
2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.3 | 12.92 | 6.58 | 3.3 | 2015-11-01 13:19:54.431082 | 2015-11-01 13:19:54.431082 |
3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 | 2015-11-01 13:19:59.961286 | 2015-11-01 13:19:59.961286 |
Instructions
- Import
sqlite3
. - Initialize a connection to
factbook.db
using theconnect() method, and store it in the variableconn
. - Use
conn
, the execute() method, and the fetchall() method to fetch all the records in thefacts
table. Assign the result to thefacts
variable. - Print out the
facts
variable. - Count the number of items in
facts
, and assign the result tofacts_count
.
import sqlite3
conn=sqlite3.connect('factbook.db')
query=('select * from facts')
facts=conn.execute(query).fetchall()
print(facts)
facts_count=len(facts)
2: Counting In SQL
Counting up the number of records in a table is a common operation, and it feels like it should be more efficient than the code we just wrote in the last screen. Thankfully, SQL includes the COUNT
aggregation function, which allows us to count the number of records in a table. It's called an aggregation function because it works across many rows to compute an aggregate value. Here's an example:
SELECT COUNT(*) FROM facts;
The query above will count the number of rows in the facts
table of factbook.db
. If we instead want to count the number of non-null values in a single column, we can use the following syntax:
SELECT COUNT(area_water)
FROM facts;
Note that the above query will only count the total number of non-null values in the area_water
column, and so can return different counts than COUNT(*)
.
Each of the queries above will return a list with a single tuple when executed in Python. It will look like this:
[(243,)]
In order to get the integer count from the result, you'll need to extract the first element in the first tuple in the results.
This style not only saves typing, it's also much faster for larger datasets, because we can do the counting inside the database, and not have to pull all the data into the Python environment first. In general, operations done within a SQL database engine will be faster than the equivalent operations done after pulling the data into a programming environment. This is because SQL database engines are optimized specifically for querying.
Instructions
- Use the
COUNT
aggregation function to count the number of non-null values in thebirth_rate
column of thefacts
table - Extract the integer value from the result, and assign to
birth_rate_count
. - Print out
birth_rate_count
.
conn = sqlite3.connect("factbook.db")
birth_rate_tuple=conn.execute('select count(birth_rate) from facts;').fetchall()
birth_rate_count=birth_rate_tuple[0][0]
print(birth_rate_tuple)
print(birth_rate_count)
3: Min And Max In SQL
SQL contains other aggregation functions besides COUNT
. MIN
and MAX
are two aggregation functions that allow us to find the maximum and minimum values in columns. Whereas we could use the COUNT
function with any column, we can only use MAX
and MIN
with numeric columns.
SELECT MAX(birth_rate)
FROM facts;
The above query will again return a list with a single tuple:
[(45.45,)]
45.45
is the highest value in the birth_rate
column of the facts
table.
Instructions
- Use the
MIN
function to find the minimum value in thepopulation_growth
column.- Extract the numeric result and assign it to
min_population_growth
. - Print
min_population_growth
.
- Extract the numeric result and assign it to
- Use the
MAX
function to find the maximum value in thedeath_rate
column.- Extract the numeric result and assign it to
max_death_rate
. - Print
max_death_rate
- Extract the numeric result and assign it to
conn = sqlite3.connect("factbook.db")
min_population_growth=conn.execute("SELECT MIN(population_growth) FROM facts;").fetchall()[0][0]
print(min_population_growth)
max_death_rate=conn.execute("SELECT MAX(death_rate) FROM facts;").fetchall()[0][0]
print(max_death_rate)
4: Sum And Average In SQL
The final two aggregation functions that we'll look at are SUM
and AVG
. SUM
finds the total of all the values in a numeric column:
SELECT SUM(birth_rate)
FROM facts;
This will again return a list with a single tuple:
[(4406.909999999998,)]
AVG
finds the mean of all the non-null values in a column:
SELECT AVG(birth_rate)
FROM facts;
The result of the above query is:
[(19.32855263157894,)]
Instructions
- Use the
SUM
function to find the sum of thearea_land
column.- Extract the numeric result and assign it to
total_land_area
. - Print
total_land_area
.
- Extract the numeric result and assign it to
- Use the
AVG
function to find the mean of thearea_water
column.- Extract the numeric result and assign it to
avg_water_area
. - Print
avg_water_area
- Extract the numeric result and assign it to
conn = sqlite3.connect("factbook.db")
total_land_area=conn.execute("select sum(area_land) from facts;").fetchall()[0][0]
print(total_land_area)
avg_water_area=conn.execute("select avg(area_water) from facts;").fetchall()[0][0]
print(avg_water_area)
5: Multiple Aggregation Functions
If we wanted to use the SUM
, AVG
, and MAX
functions on a column, it would be inefficient to write three different queries to retrieve the information. You may recall that we can query multiple columns by separating the names with a comma:
SELECT birth_rate, death_rate, population_growth
FROM facts;
We can apply the sample principle to use multiple aggregation functions in one query:
SELECT COUNT(*), SUM(death_rate), AVG(population_growth)
FROM facts;
Because there are three aggregation functions specified in the query, it will return a list containing a tuple with three elements:
[(261, 1783.2500000000002, 1.2009745762711865)]
The order of the aggregation functions in the query corresponds to the order of the results. So the first element in the tuple is the count of all the rows, the second is the sum of the death_rate
column, and the third is the mean of the population_growth
column.
Instructions
- Write a single query that calculates the following statistics about the
facts
table, in order:- The mean of the
population
column. - The sum of the
population
column. - The maximum value in the
birth_rate
column.
- The mean of the
- Assign the result of the query to
facts_stats
. - Print
facts_stats
conn = sqlite3.connect("factbook.db")
facts_stats=conn.execute("SELECT AVG(population),SUM(population),MAX(birth_rate) FROM facts;").fetchall()
print(facts_stats)
6: Conditional Aggregation
As you may recall from earlier, we can use the WHERE
statement to only query certain rows in a SQL table:
SELECT population
FROM facts
WHERE birth_rate > 10;
The above query will select any values in the population
column where the birth_rate
is higher than 10
. We can also use WHERE
statements with aggregation functions to only calculate statistics for a certain subset of rows:
SELECT COUNT(*)
FROM facts
WHERE population > 5000000;
The query above will count the number of rows where population
is greater than 5000000
.
Instructions
- Calculate the mean
population_growth
for countries with apopulation
greater than10000000
.- Extract the numeric result and assign it to
population_growth
.
- Extract the numeric result and assign it to
-
- Print
population_growth
- Print
conn = sqlite3.connect("factbook.db")
population_growth=conn.execute("select AVG(population_growth) from facts where population>10000000;").fetchall()[0][0]
7: Selecting Unique Rows
There are cases when we'll only want to select the unique values in a column or database, and not get each individual row. One example is if our facts
table had duplicate entries for each country:
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate | created_at | updated_at |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 | 2015-11-01 13:19:49.461734 | 2015-11-01 13:19:49.461734 |
2 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 | 2015-11-01 13:19:49.461734 | 2015-11-01 13:19:49.461734 |
If we want to get a list of all the countries in the world, we'll need to remove these duplicate rows, so countries appear twice. We can do this with the DISTINCT
statement:
SELECT DISTINCT name
FROM facts;
The above query will return all of the unique values in the name
column of facts
. It won't return any values twice.
The DISTINCT
statement can also be used with multiple columns, in which case, it will return unique groups of those columns:
SELECT DISTINCT name, population
FROM facts;
The above query will select unique pairs of population
and name
values from facts
.
Instructions
- Select all the distinct values in the
birth_rate
column of thefacts
table, and assign the result to theunique_birth_rates
. - Print
unique_birth_rates
8: Distinct Aggregations
If we wanted to count the number of unique items in the population
column, we could use the COUNT
aggregation function along with the DISTINCT
statement. Here's how it would work:
SELECT COUNT(DISTINCT population)
FROM facts;
The above query will count all the distinct values in the population
column. We can also use other aggregation functions along with theDISTINCT
statement:
SELECT AVG(DISTINCT birth_rate)
FROM facts;
The above query will find the mean of all the distinct values in the birth_rate
column.
Instructions
- Find the average of all the distinct values in the
birth_rate
column whenpopulation
is greater than20000000
.- Extract the numeric result, and assign to
average_birth_rate
. - Print
average_birth_rate
.
- Extract the numeric result, and assign to
- Find the sum of all the distinct values in the
population
column whenarea_land
is greater than1000000
.- Extract the numeric result, and assign to
sum_population
. - Print
sum_population
- Extract the numeric result, and assign to
conn = sqlite3.connect("factbook.db")
average_birth_rate=conn.execute("select avg(distinct birth_rate) from facts where population>20000000;").fetchall()[0][0]
sum_population=conn.execute("select sum(distinct population) from facts where area_land>1000000;").fetchall()[0][0]
print(average_birth_rate)
print(sum_population)
9: Arithmetic In SQL
There are times when we'll want to do some arithmetic on columns in a SQL table. One example is making the counts in the population
column easier to understand by expressing them in terms of millions. Instead of a number like 9766442
, we'd want to display 9.766442
. We could do this in Python, but it would be cumbersome to pull all the data into the Python environment, then manipulate it. Instead, we can perform the math inside the SQL database engine:
SELECT population / 1000000
FROM facts;
The above query will divide every value in the population
column by 1000000
, and return the result. Because the population
column contains integers, and we specified an integer to divide by, the results will be integers as well. If we want to retain precision, we can specify a float instead:
SELECT population / 1000000.0
FROM facts;
The above query will return a series of floats, instead of rounding the values to integers. Here's the rules for what an arithmetic operation will return:
- Two floats -- returns a float (ex.
SELECT birth_rate / 1000000.0 FROM facts;
) - A float and an integer -- returns a float (ex.
SELECT population / 1000000.0 FROM facts;
) - Two integers -- returns an integer (ex.
SELECT population / 1000000 FROM facts;
)
Instructions
- Use arithmetic operators in a SQL query to express
population_growth
in terms of millions. Ensure that you divide by a float so that the result is also a float.- Assign the result of the query to
population_growth_millions
. - Print
population_growth_millions
- Assign the result of the query to
conn = sqlite3.connect("factbook.db")
population_growth_millions=conn.execute("select population_growth/1000000.0 from facts;").fetchall()
print(population_growth_millions)
10: Arithmetic Between Columns
A few screens ago, we learned how to apply aggregation functions to columns after the SELECT
statement:
SELECT AVG(birth_rate), SUM(population)
FROM facts;
This modified the values of the columns before they were returned. SQL lets us perform many different kinds of manipulations on the columns we select. If we wanted to calculate the ratio between births and deaths for each country, we could divide the birth_rate
column by the death_rate
column. Here's how we could do it:
SELECT birth_rate / death_rate
FROM facts;
The above query will divide each value in the birth_rate
column by its corresponding value in the death_rate
column.
We can also perform more complex queries, such as finding the ratio of birth_rate
plus migration_rate
to death_rate
, which will help us discover if the population is increasing or decreasing:
SELECT (birth_rate + migration_rate) / death_rate
FROM facts;
The above query will add together the birth_rate
and migration_rate
columns, then divide by the death_rate
column. Arithmetic in SQL respects the order of operations and parentheses, so the addition step happens before the division step.
Instructions
- Use a SQL query to compute the population of each country a year from now.
- Multiply the
population
andpopulation_growth
columns, then add thepopulation
column to the result.
- Multiply the
- Assign the result of the query to
next_year_population
. - Print
next_year_population
conn = sqlite3.connect("factbook.db")
next_year_population=conn.execute("select population*population_growth+population from facts;").fetchall()
print(next_year_population)
11: Next Steps
In this mission, we covered computing summary statistics in SQL. It's often advantageous to do these computations in the SQL database versus in a Python environment because it's faster to code up and execute. In the next mission, we'll cover computing more advanced statistics in SQL with the GROUP BY
statement.