SQL Summary Statistics

1: Counting In Python

In this mission, we'll be working on computing summary statistics using SQL. There have been many cases in the past few missions when we've needed to count the number of records that match a particular SQL query. So far, we've been able to do this by:

  • Performing a SQL query with Python.
  • Retrieving the results and storing into a list.
  • Finding the length of the list.

This approach works, but also requires quite a bit of code and is fairly slow. As we go through this mission, we'll learn how to count records and more using only SQL.

We'll be working with factbook.db, a SQLite database that contains information about each country in the world. We'll use a table in the file called facts. Each row in facts represents a single country, and contains several columns, including:

  • name -- the name of the country.
  • area -- the total land and sea area of the country.
  • population -- the population of the country.
  • birth_rate -- the birth rate of the country.
  • created_at -- the date the record was created.
  • updated_at -- the date the record was updated.

Here are the first few rows of facts:

idcodenameareaarea_landarea_waterpopulationpopulation_growthbirth_ratedeath_ratemigration_ratecreated_atupdated_at
1afAfghanistan6522306522300325643422.3238.5713.891.512015-11-01 13:19:49.4617342015-11-01 13:19:49.461734
2alAlbania2874827398135030292780.312.926.583.32015-11-01 13:19:54.4310822015-11-01 13:19:54.431082
3agAlgeria238174123817410395421661.8423.674.310.922015-11-01 13:19:59.9612862015-11-01 13:19:59.961286

Instructions

  • Import sqlite3.
  • Initialize a connection tofactbook.db using theconnect() method, and store it in the variable conn.
  • Use conn, the execute() method, and the fetchall() method to fetch all the records in the facts table. Assign the result to the factsvariable.
  • Print out the facts variable.
  • Count the number of items infacts, and assign the result tofacts_count.

import sqlite3
conn=sqlite3.connect('factbook.db')
query=('select * from facts')
facts=conn.execute(query).fetchall()
print(facts)
facts_count=len(facts)

2: Counting In SQL

Counting up the number of records in a table is a common operation, and it feels like it should be more efficient than the code we just wrote in the last screen. Thankfully, SQL includes the COUNT aggregation function, which allows us to count the number of records in a table. It's called an aggregation function because it works across many rows to compute an aggregate value. Here's an example:

 

SELECT COUNT(*) FROM facts;

The query above will count the number of rows in the facts table of factbook.db. If we instead want to count the number of non-null values in a single column, we can use the following syntax:

 

SELECT COUNT(area_water)

FROM facts;

Note that the above query will only count the total number of non-null values in the area_water column, and so can return different counts than COUNT(*).

Each of the queries above will return a list with a single tuple when executed in Python. It will look like this:

 

[(243,)]

In order to get the integer count from the result, you'll need to extract the first element in the first tuple in the results.

This style not only saves typing, it's also much faster for larger datasets, because we can do the counting inside the database, and not have to pull all the data into the Python environment first. In general, operations done within a SQL database engine will be faster than the equivalent operations done after pulling the data into a programming environment. This is because SQL database engines are optimized specifically for querying.

Instructions

  • Use the COUNT aggregation function to count the number of non-null values in thebirth_rate column of thefacts table
  • Extract the integer value from the result, and assign tobirth_rate_count.
  • Print out birth_rate_count.

 

conn = sqlite3.connect("factbook.db")
birth_rate_tuple=conn.execute('select count(birth_rate) from facts;').fetchall()
birth_rate_count=birth_rate_tuple[0][0]
print(birth_rate_tuple)
print(birth_rate_count)

 

3: Min And Max In SQL

SQL contains other aggregation functions besides COUNTMIN and MAX are two aggregation functions that allow us to find the maximum and minimum values in columns. Whereas we could use the COUNT function with any column, we can only use MAX and MINwith numeric columns.

 

SELECT MAX(birth_rate)

FROM facts;

The above query will again return a list with a single tuple:

 

[(45.45,)]

45.45 is the highest value in the birth_rate column of the facts table.

Instructions

  • Use the MIN function to find the minimum value in thepopulation_growth column.
    • Extract the numeric result and assign it tomin_population_growth.
    • Printmin_population_growth.
  • Use the MAX function to find the maximum value in thedeath_rate column.
    • Extract the numeric result and assign it tomax_death_rate.
    • Print max_death_rate

conn = sqlite3.connect("factbook.db")
min_population_growth=conn.execute("SELECT MIN(population_growth) FROM facts;").fetchall()[0][0]

print(min_population_growth)
max_death_rate=conn.execute("SELECT MAX(death_rate) FROM facts;").fetchall()[0][0]
print(max_death_rate)

 

4: Sum And Average In SQL

The final two aggregation functions that we'll look at are SUM and AVGSUM finds the total of all the values in a numeric column:

 

SELECT SUM(birth_rate)

FROM facts;

This will again return a list with a single tuple:

 

[(4406.909999999998,)]

AVG finds the mean of all the non-null values in a column:

 

SELECT AVG(birth_rate)

FROM facts;

The result of the above query is:

 

[(19.32855263157894,)]

Instructions

  • Use the SUM function to find the sum of the area_land column.
    • Extract the numeric result and assign it tototal_land_area.
    • Print total_land_area.
  • Use the AVG function to find the mean of the area_watercolumn.
    • Extract the numeric result and assign it toavg_water_area.
    • Print avg_water_area

conn = sqlite3.connect("factbook.db")
total_land_area=conn.execute("select sum(area_land) from facts;").fetchall()[0][0]
print(total_land_area)

avg_water_area=conn.execute("select avg(area_water) from facts;").fetchall()[0][0]

print(avg_water_area)

 

5: Multiple Aggregation Functions

If we wanted to use the SUMAVG, and MAX functions on a column, it would be inefficient to write three different queries to retrieve the information. You may recall that we can query multiple columns by separating the names with a comma:

 

SELECT birth_rate, death_rate, population_growth

FROM facts;

We can apply the sample principle to use multiple aggregation functions in one query:

 

SELECT COUNT(*), SUM(death_rate), AVG(population_growth)

FROM facts;

Because there are three aggregation functions specified in the query, it will return a list containing a tuple with three elements:

 

[(261, 1783.2500000000002, 1.2009745762711865)]

The order of the aggregation functions in the query corresponds to the order of the results. So the first element in the tuple is the count of all the rows, the second is the sum of the death_rate column, and the third is the mean of the population_growth column.

Instructions

  • Write a single query that calculates the following statistics about the facts table, in order:
    • The mean of thepopulation column.
    • The sum of thepopulation column.
    • The maximum value in thebirth_rate column.
  • Assign the result of the query tofacts_stats.
  • Print facts_stats

 

 

conn = sqlite3.connect("factbook.db")
facts_stats=conn.execute("SELECT AVG(population),SUM(population),MAX(birth_rate) FROM facts;").fetchall()
print(facts_stats)

6: Conditional Aggregation

As you may recall from earlier, we can use the WHERE statement to only query certain rows in a SQL table:

 

SELECT population

FROM facts

WHERE birth_rate > 10;

The above query will select any values in the population column where the birth_rate is higher than 10. We can also use WHEREstatements with aggregation functions to only calculate statistics for a certain subset of rows:

 

SELECT COUNT(*)

FROM facts

WHERE population > 5000000;

The query above will count the number of rows where population is greater than 5000000.

Instructions

  • Calculate the meanpopulation_growth for countries with a populationgreater than 10000000.
    • Extract the numeric result and assign it topopulation_growth.
    • Printpopulation_growth

 

conn = sqlite3.connect("factbook.db")
population_growth=conn.execute("select AVG(population_growth) from facts where population>10000000;").fetchall()[0][0]

7: Selecting Unique Rows

There are cases when we'll only want to select the unique values in a column or database, and not get each individual row. One example is if our facts table had duplicate entries for each country:

idcodenameareaarea_landarea_waterpopulationpopulation_growthbirth_ratedeath_ratemigration_ratecreated_atupdated_at
1afAfghanistan6522306522300325643422.3238.5713.891.512015-11-01 13:19:49.4617342015-11-01 13:19:49.461734
2afAfghanistan6522306522300325643422.3238.5713.891.512015-11-01 13:19:49.4617342015-11-01 13:19:49.461734

If we want to get a list of all the countries in the world, we'll need to remove these duplicate rows, so countries appear twice. We can do this with the DISTINCT statement:

 

SELECT DISTINCT name

FROM facts;

The above query will return all of the unique values in the name column of facts. It won't return any values twice.

The DISTINCT statement can also be used with multiple columns, in which case, it will return unique groups of those columns:

 

SELECT DISTINCT name, population

FROM facts;

The above query will select unique pairs of population and name values from facts.

Instructions

  • Select all the distinct values in thebirth_rate column of thefacts table, and assign the result to the unique_birth_rates.
  • Print unique_birth_rates

8: Distinct Aggregations

If we wanted to count the number of unique items in the population column, we could use the COUNT aggregation function along with the DISTINCT statement. Here's how it would work:

 

SELECT COUNT(DISTINCT population)

FROM facts;

The above query will count all the distinct values in the population column. We can also use other aggregation functions along with theDISTINCT statement:

 

SELECT AVG(DISTINCT birth_rate)

FROM facts;

The above query will find the mean of all the distinct values in the birth_rate column.

Instructions

  • Find the average of all the distinct values in the birth_ratecolumn when population is greater than 20000000.
    • Extract the numeric result, and assign toaverage_birth_rate.
    • Printaverage_birth_rate.
  • Find the sum of all the distinct values in the populationcolumn when area_land is greater than 1000000.
    • Extract the numeric result, and assign tosum_population.
    • Print sum_population

conn = sqlite3.connect("factbook.db")
average_birth_rate=conn.execute("select avg(distinct birth_rate) from facts where population>20000000;").fetchall()[0][0]
sum_population=conn.execute("select sum(distinct population) from facts where area_land>1000000;").fetchall()[0][0]
print(average_birth_rate)
print(sum_population)

 

9: Arithmetic In SQL

There are times when we'll want to do some arithmetic on columns in a SQL table. One example is making the counts in the populationcolumn easier to understand by expressing them in terms of millions. Instead of a number like 9766442, we'd want to display 9.766442. We could do this in Python, but it would be cumbersome to pull all the data into the Python environment, then manipulate it. Instead, we can perform the math inside the SQL database engine:

 

SELECT population / 1000000

FROM facts;

The above query will divide every value in the population column by 1000000, and return the result. Because the population column contains integers, and we specified an integer to divide by, the results will be integers as well. If we want to retain precision, we can specify a float instead:

 

SELECT population / 1000000.0

FROM facts;

The above query will return a series of floats, instead of rounding the values to integers. Here's the rules for what an arithmetic operation will return:

  • Two floats -- returns a float (ex. SELECT birth_rate / 1000000.0 FROM facts;)
  • A float and an integer -- returns a float (ex. SELECT population / 1000000.0 FROM facts;)
  • Two integers -- returns an integer (ex. SELECT population / 1000000 FROM facts;)

Instructions

  • Use arithmetic operators in a SQL query to expresspopulation_growth in terms of millions. Ensure that you divide by a float so that the result is also a float.
    • Assign the result of the query topopulation_growth_millions.
    • Printpopulation_growth_millions

conn = sqlite3.connect("factbook.db")
population_growth_millions=conn.execute("select population_growth/1000000.0 from facts;").fetchall()
print(population_growth_millions)

10: Arithmetic Between Columns

A few screens ago, we learned how to apply aggregation functions to columns after the SELECT statement:

 

SELECT AVG(birth_rate), SUM(population)

FROM facts;

This modified the values of the columns before they were returned. SQL lets us perform many different kinds of manipulations on the columns we select. If we wanted to calculate the ratio between births and deaths for each country, we could divide the birth_ratecolumn by the death_rate column. Here's how we could do it:

 

SELECT birth_rate / death_rate

FROM facts;

The above query will divide each value in the birth_rate column by its corresponding value in the death_rate column.

We can also perform more complex queries, such as finding the ratio of birth_rate plus migration_rate to death_rate, which will help us discover if the population is increasing or decreasing:

 

SELECT (birth_rate + migration_rate) / death_rate

FROM facts;

The above query will add together the birth_rate and migration_rate columns, then divide by the death_rate column. Arithmetic in SQL respects the order of operations and parentheses, so the addition step happens before the division step.

Instructions

  • Use a SQL query to compute the population of each country a year from now.
    • Multiply the populationand population_growthcolumns, then add thepopulation column to the result.
  • Assign the result of the query tonext_year_population.
  • Print next_year_population

 

conn = sqlite3.connect("factbook.db")
next_year_population=conn.execute("select population*population_growth+population from facts;").fetchall()
print(next_year_population)

 

11: Next Steps

In this mission, we covered computing summary statistics in SQL. It's often advantageous to do these computations in the SQL database versus in a Python environment because it's faster to code up and execute. In the next mission, we'll cover computing more advanced statistics in SQL with the GROUP BY statement.

 

转载于:https://my.oschina.net/Bettyty/blog/747367

import sys sys.tracebacklimit = 0 import os os.environ['PYTHONUNBUFFERED'] = '1' import arcpy # 获取参数 input_features = arcpy.GetParameterAsText(0) join_field = arcpy.GetParameterAsText(1) target_feature = arcpy.GetParameterAsText(2) target_field = arcpy.GetParameterAsText(3) area_threshold = arcpy.GetParameterAsText(4) # 创建空间连接 join_result = arcpy.SpatialJoin_analysis(input_features, target_feature, "in_memory/spatial_join", "JOIN_ONE_TO_ONE", "KEEP_ALL", "", "INTERSECT") # 使用MakeFeatureLayer创建要素图层,并使用AddFieldDelimiters处理字段名称 join_layer = arcpy.management.MakeFeatureLayer(join_result, "join_layer").getOutput(0) join_field_name = arcpy.AddFieldDelimiters(join_layer, join_field) # 使用SelectLayerByAttribute选择重叠面积大于阈值的要素 arcpy.management.SelectLayerByAttribute(join_layer, "NEW_SELECTION", "Shape_Area > " + str(area_threshold)) # 使用SummaryStatistics工具进行面积求和 summary_table = arcpy.Statistics_analysis(join_layer, "in_memory/summary_table", [["Shape_Area", "SUM"]], [join_field_name]) # 使用TableToNumPyArray将结果转换为字典 sum_dict = {} with arcpy.da.TableToNumPyArray(summary_table, [join_field, "SUM_Shape_Area"]) as arr: for row in arr: sum_dict[row[0]] = row[1] # 使用UpdateCursor更新目标要素类的目标字段 with arcpy.da.UpdateCursor(target_feature, [target_field, join_field], sql_clause=(None, "ORDER BY OBJECTID")) as cursor: for row in cursor: join_value = row[1] if join_value in sum_dict: area_sum = sum_dict[join_value] row[0] = area_sum cursor.updateRow(row) # 导出结果 output_feature = arcpy.GetParameterAsText(5) arcpy.CopyFeatures_management(target_feature, output_feature) # 删除游标对象和要素图层对象 del cursor, join_layer运行错误:Traceback (most recent call last): ExecuteError: 执行失败。参数无效。 WARNING 000725: 输出表: 数据集 in_memory\summary_table 已存在。 ERROR 000728: 表中不存在字段 "数据" 执行(Statistics)失败。 执行(ccc)失败。请改正代码
最新发布
05-24
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值