SQL Summary Statistics

最新推荐文章于 2023-04-03 17:32:03 发布

weixin_33737134

最新推荐文章于 2023-04-03 17:32:03 发布

阅读量172

点赞数

文章标签： python 数据库

原文链接：https://my.oschina.net/Bettyty/blog/747367

版权

2019独角兽企业重金招聘Python工程师标准>>>

1: Counting In Python

In this mission, we'll be working on computing summary statistics using SQL. There have been many cases in the past few missions when we've needed to count the number of records that match a particular SQL query. So far, we've been able to do this by:

Performing a SQL query with Python.
Retrieving the results and storing into a list.
Finding the length of the list.

This approach works, but also requires quite a bit of code and is fairly slow. As we go through this mission, we'll learn how to count records and more using only SQL.

We'll be working with factbook.db, a SQLite database that contains information about each country in the world. We'll use a table in the file called facts. Each row in facts represents a single country, and contains several columns, including:

name -- the name of the country.
area -- the total land and sea area of the country.
population -- the population of the country.
birth_rate -- the birth rate of the country.
created_at -- the date the record was created.
updated_at -- the date the record was updated.

Here are the first few rows of facts:

id	code	name	area	area_land	area_water	population	population_growth	birth_rate	death_rate	migration_rate	created_at	updated_at
1	af	Afghanistan	652230	652230	0	32564342	2.32	38.57	13.89	1.51	2015-11-01 13:19:49.461734	2015-11-01 13:19:49.461734
2	al	Albania	28748	27398	1350	3029278	0.3	12.92	6.58	3.3	2015-11-01 13:19:54.431082	2015-11-01 13:19:54.431082
3	ag	Algeria	2381741	2381741	0	39542166	1.84	23.67	4.31	0.92	2015-11-01 13:19:59.961286	2015-11-01 13:19:59.961286

Instructions

Import sqlite3.
Initialize a connection tofactbook.db using theconnect() method, and store it in the variable conn.
Use conn, the execute() method, and the fetchall() method to fetch all the records in the facts table. Assign the result to the factsvariable.
Print out the facts variable.
Count the number of items infacts, and assign the result tofacts_count.

import sqlite3
conn=sqlite3.connect('factbook.db')
query=('select * from facts')
facts=conn.execute(query).fetchall()
print(facts)
facts_count=len(facts)

2: Counting In SQL

Counting up the number of records in a table is a common operation, and it feels like it should be more efficient than the code we just wrote in the last screen. Thankfully, SQL includes the COUNT aggregation function, which allows us to count the number of records in a table. It's called an aggregation function because it works across many rows to compute an aggregate value. Here's an example:

SELECT COUNT(*) FROM facts;

The query above will count the number of rows in the facts table of factbook.db. If we instead want to count the number of non-null values in a single column, we can use the following syntax:

SELECT COUNT(area_water)

FROM facts;

Note that the above query will only count the total number of non-null values in the area_water column, and so can return different counts than COUNT(*).

Each of the queries above will return a list with a single tuple when executed in Python. It will look like this:

[(243,)]

In order to get the integer count from the result, you'll need to extract the first element in the first tuple in the results.

This style not only saves typing, it's also much faster for larger datasets, because we can do the counting inside the database, and not have to pull all the data into the Python environment first. In general, operations done within a SQL database engine will be faster than the equivalent operations done after pulling the data into a programming environment. This is because SQL database engines are optimized specifically for querying.

Instructions

Use the COUNT aggregation function to count the number of non-null values in thebirth_rate column of thefacts table
Extract the integer value from the result, and assign tobirth_rate_count.
Print out birth_rate_count.

conn = sqlite3.connect("factbook.db")
birth_rate_tuple=conn.execute('select count(birth_rate) from facts;').fetchall()
birth_rate_count=birth_rate_tuple[0][0]
print(birth_rate_tuple)
print(birth_rate_count)

3: Min And Max In SQL

SQL contains other aggregation functions besides COUNT. MIN and MAX are two aggregation functions that allow us to find the maximum and minimum values in columns. Whereas we could use the COUNT function with any column, we can only use MAX and MINwith numeric columns.

SELECT MAX(birth_rate)

FROM facts;

The above query will again return a list with a single tuple:

[(45.45,)]

45.45 is the highest value in the birth_rate column of the facts table.

Instructions

Use the MIN function to find the minimum value in thepopulation_growth column.
- Extract the numeric result and assign it tomin_population_growth.
- Printmin_population_growth.
Use the MAX function to find the maximum value in thedeath_rate column.
- Extract the numeric result and assign it tomax_death_rate.
- Print max_death_rate

conn = sqlite3.connect("factbook.db")
min_population_growth=conn.execute("SELECT MIN(population_growth) FROM facts;").fetchall()[0][0]

print(min_population_growth)
max_death_rate=conn.execute("SELECT MAX(death_rate) FROM facts;").fetchall()[0][0]
print(max_death_rate)

4: Sum And Average In SQL

The final two aggregation functions that we'll look at are SUM and AVG. SUM finds the total of all the values in a numeric column:

SELECT SUM(birth_rate)

FROM facts;

This will again return a list with a single tuple:

[(4406.909999999998,)]

AVG finds the mean of all the non-null values in a column:

SELECT AVG(birth_rate)

FROM facts;

The result of the above query is:

[(19.32855263157894,)]

Instructions

Use the SUM function to find the sum of the area_land column.
- Extract the numeric result and assign it tototal_land_area.
- Print total_land_area.
Use the AVG function to find the mean of the area_watercolumn.
- Extract the numeric result and assign it toavg_water_area.
- Print avg_water_area

conn = sqlite3.connect("factbook.db")
total_land_area=conn.execute("select sum(area_land) from facts;").fetchall()[0][0]
print(total_land_area)

avg_water_area=conn.execute("select avg(area_water) from facts;").fetchall()[0][0]

print(avg_water_area)

5: Multiple Aggregation Functions

If we wanted to use the SUM, AVG, and MAX functions on a column, it would be inefficient to write three different queries to retrieve the information. You may recall that we can query multiple columns by separating the names with a comma:

SELECT birth_rate, death_rate, population_growth

FROM facts;

We can apply the sample principle to use multiple aggregation functions in one query:

SELECT COUNT(*), SUM(death_rate), AVG(population_growth)

FROM facts;

Because there are three aggregation functions specified in the query, it will return a list containing a tuple with three elements:

[(261, 1783.2500000000002, 1.2009745762711865)]

The order of the aggregation functions in the query corresponds to the order of the results. So the first element in the tuple is the count of all the rows, the second is the sum of the death_rate column, and the third is the mean of the population_growth column.

Instructions

Write a single query that calculates the following statistics about the facts table, in order:
- The mean of thepopulation column.
- The sum of thepopulation column.
- The maximum value in thebirth_rate column.
Assign the result of the query tofacts_stats.
Print facts_stats

conn = sqlite3.connect("factbook.db")
facts_stats=conn.execute("SELECT AVG(population),SUM(population),MAX(birth_rate) FROM facts;").fetchall()
print(facts_stats)

6: Conditional Aggregation

As you may recall from earlier, we can use the WHERE statement to only query certain rows in a SQL table:

SELECT population

FROM facts

WHERE birth_rate > 10;

The above query will select any values in the population column where the birth_rate is higher than 10. We can also use WHEREstatements with aggregation functions to only calculate statistics for a certain subset of rows:

SELECT COUNT(*)

FROM facts

WHERE population > 5000000;

The query above will count the number of rows where population is greater than 5000000.

Instructions

Calculate the meanpopulation_growth for countries with a populationgreater than 10000000.
- Extract the numeric result and assign it topopulation_growth.
- Printpopulation_growth

conn = sqlite3.connect("factbook.db")
population_growth=conn.execute("select AVG(population_growth) from facts where population>10000000;").fetchall()[0][0]

7: Selecting Unique Rows

There are cases when we'll only want to select the unique values in a column or database, and not get each individual row. One example is if our facts table had duplicate entries for each country:

id	code	name	area	area_land	area_water	population	population_growth	birth_rate	death_rate	migration_rate	created_at	updated_at
1	af	Afghanistan	652230	652230	0	32564342	2.32	38.57	13.89	1.51	2015-11-01 13:19:49.461734	2015-11-01 13:19:49.461734
2	af	Afghanistan	652230	652230	0	32564342	2.32	38.57	13.89	1.51	2015-11-01 13:19:49.461734	2015-11-01 13:19:49.461734

If we want to get a list of all the countries in the world, we'll need to remove these duplicate rows, so countries appear twice. We can do this with the DISTINCT statement:

SELECT DISTINCT name

FROM facts;

The above query will return all of the unique values in the name column of facts. It won't return any values twice.

The DISTINCT statement can also be used with multiple columns, in which case, it will return unique groups of those columns:

SELECT DISTINCT name, population

FROM facts;

The above query will select unique pairs of population and name values from facts.

Instructions

Select all the distinct values in thebirth_rate column of thefacts table, and assign the result to the unique_birth_rates.
Print unique_birth_rates

8: Distinct Aggregations

If we wanted to count the number of unique items in the population column, we could use the COUNT aggregation function along with the DISTINCT statement. Here's how it would work:

SELECT COUNT(DISTINCT population)

FROM facts;

The above query will count all the distinct values in the population column. We can also use other aggregation functions along with theDISTINCT statement:

SELECT AVG(DISTINCT birth_rate)

FROM facts;

The above query will find the mean of all the distinct values in the birth_rate column.

Instructions

Find the average of all the distinct values in the birth_ratecolumn when population is greater than 20000000.
- Extract the numeric result, and assign toaverage_birth_rate.
- Printaverage_birth_rate.
Find the sum of all the distinct values in the populationcolumn when area_land is greater than 1000000.
- Extract the numeric result, and assign tosum_population.
- Print sum_population

conn = sqlite3.connect("factbook.db")
average_birth_rate=conn.execute("select avg(distinct birth_rate) from facts where population>20000000;").fetchall()[0][0]
sum_population=conn.execute("select sum(distinct population) from facts where area_land>1000000;").fetchall()[0][0]
print(average_birth_rate)
print(sum_population)

9: Arithmetic In SQL

There are times when we'll want to do some arithmetic on columns in a SQL table. One example is making the counts in the populationcolumn easier to understand by expressing them in terms of millions. Instead of a number like 9766442, we'd want to display 9.766442. We could do this in Python, but it would be cumbersome to pull all the data into the Python environment, then manipulate it. Instead, we can perform the math inside the SQL database engine:

SELECT population / 1000000

FROM facts;

The above query will divide every value in the population column by 1000000, and return the result. Because the population column contains integers, and we specified an integer to divide by, the results will be integers as well. If we want to retain precision, we can specify a float instead:

SELECT population / 1000000.0

FROM facts;

The above query will return a series of floats, instead of rounding the values to integers. Here's the rules for what an arithmetic operation will return:

Two floats -- returns a float (ex. SELECT birth_rate / 1000000.0 FROM facts;)
A float and an integer -- returns a float (ex. SELECT population / 1000000.0 FROM facts;)
Two integers -- returns an integer (ex. SELECT population / 1000000 FROM facts;)

Instructions

Use arithmetic operators in a SQL query to expresspopulation_growth in terms of millions. Ensure that you divide by a float so that the result is also a float.
- Assign the result of the query topopulation_growth_millions.
- Printpopulation_growth_millions

conn = sqlite3.connect("factbook.db")
population_growth_millions=conn.execute("select population_growth/1000000.0 from facts;").fetchall()
print(population_growth_millions)

10: Arithmetic Between Columns

A few screens ago, we learned how to apply aggregation functions to columns after the SELECT statement:

SELECT AVG(birth_rate), SUM(population)

FROM facts;

This modified the values of the columns before they were returned. SQL lets us perform many different kinds of manipulations on the columns we select. If we wanted to calculate the ratio between births and deaths for each country, we could divide the birth_ratecolumn by the death_rate column. Here's how we could do it:

SELECT birth_rate / death_rate

FROM facts;

The above query will divide each value in the birth_rate column by its corresponding value in the death_rate column.

We can also perform more complex queries, such as finding the ratio of birth_rate plus migration_rate to death_rate, which will help us discover if the population is increasing or decreasing:

SELECT (birth_rate + migration_rate) / death_rate

FROM facts;

The above query will add together the birth_rate and migration_rate columns, then divide by the death_rate column. Arithmetic in SQL respects the order of operations and parentheses, so the addition step happens before the division step.

Instructions

Use a SQL query to compute the population of each country a year from now.
- Multiply the populationand population_growthcolumns, then add thepopulation column to the result.
Assign the result of the query tonext_year_population.
Print next_year_population

conn = sqlite3.connect("factbook.db")
next_year_population=conn.execute("select population*population_growth+population from facts;").fetchall()
print(next_year_population)

11: Next Steps

In this mission, we covered computing summary statistics in SQL. It's often advantageous to do these computations in the SQL database versus in a Python environment because it's faster to code up and execute. In the next mission, we'll cover computing more advanced statistics in SQL with the GROUP BY statement.

转载于:https://my.oschina.net/Bettyty/blog/747367