1: The Dataset
In this project, you'll be working with the CIA World Factbook, a compendium of country facts. The Factbook contains demographic information for each country in the world, including:
population
-- the population as of2015
.population_growth
-- the annual population growth rate, as a percentage.area
-- the total land and water area.
You can download the Factbook as a SQLite database here if you want to work with it on your own computer. In this guided project, you'll be working with Python and the SQLite command line tool, to connect to the database, extract data, and perform analysis.
Instructions
For now, just hit "Next" to get started with the project!
2: The SQLite Command Shell
SQLite is a relational database management system that enables you to create databases and query them using SQL syntax. SQLite is simpler than full database systems like MySQL and PostgreSQL. SQLite is good for cases where ease of use is more important than performance. Each SQLite database is stored as a single file, making it easy to transport.
The Factbook database is stored in the filefactbook.db
. db
is a file extension that is short for database.
If you're in the same folder as factbook.db
, and you type sqlite3 factbook.db
on the command line, you'll open the Factbook database in the SQLite Command Shell. This enables you to manage the database and run SQL queries.
Instructions
Use the SQLite Shell to explore factbook.db
.
- Connect to
factbook.db
using the SQLite shell. - Type
.help
into the shell to see a list of commands you can run in the shell. - Type
.tables
to see a list of the tables in the database. -
If you type
.header on
, you'll see the column headers when you run queries.If you type an incomplete command (missing ending
;
for example), your command won't be executed and you'll be instead taken to an indented line. Type;
in the indented line (and press Enter) to exit the indentation and run the command. Here's an example:
When you're done with the Command Shell, you can type .quit
to quit. Don't quit the shell just yet since you'll be using it in the next step
3: Running Queries
The SQLite Command Shell also allows you to run any valid SQL query. For example, you could run the following:
SELECT * FROM facts;
This will show you all the rows in the facts
table.
Instructions
Run some queries in the SQLite Command Shell. Make sure to turn headers on with .header on
to see headers for each column.
You should think of your own queries, but here are some examples:
- SELECT * FROM facts ORDER BY population DESC LIMIT 10;
- SELECT * FROM facts ORDER BY area_land ASC LIMIT 10;
You may notice that these queries will show you some strange results, such as Ethiopia
having the least land area. The queries also include non-national entities like the European Union
and Akrotiri
.
The data is fairly messy, and some values in the area_land
column are missing. Add WHERE area_land != ""
to the query before theORDER BY
clause to remove the invalid rows. You may also need to try additional filtering.
When you're done exploring, you can quit the SQLite Command Shell with .quit
4: Using Python With SQLite
The sqlite3
library, which comes default in Python, allows us to connect to SQLite databases. To do this, we open a database connection, then create an object that can run queries.
For example, this will let us connect tofactbook.db
, and select all of the rows:
import sqlite3
conn = sqlite3.connect('factbook.db')
c = conn.cursor()
c.execute('SELECT * FROM facts;')
print(c.fetchall())
The code above creates a Connection object. We then create a Cursor instance, which can execute queries. Finally, we execute a query and display the results using the print
function. For more depth, read the package documentation here.
Instructions
-
Write code in
query.py
that will select the10
countries with the least population from thefacts
table, and then print the results. -
Execute
query.py
from the command line by runningpython query.py
.
5: Computing Projected Population
You can read the results of a SQL query into a Pandas Dataframe using the read_sql_query
function, which is documented here. Theread_sql_query
function takes a SQL query string and a connection object, and returns a Dataframe containing all the rows and columns from the query.
When Pandas reads in columns this way, it sets the type of the column to be the type of the column in the database. Some columns, likearea_land
, which had blank entries in the database, will have NaN
values in a DataFrame, corresponding to "Not a Number". This is because Pandas can't have blanks in numeric columns, and uses NaN
as a way to signify an invalid or missing value.
See this Pandas documentation on how to work with missing data. The best way to deal with it for now is to just use the dropna
method, with the axis=0
argument, which will drop any rows that have NaN
values in them.
Instructions
Read the facts
table into Pandas, and then compute the projected population for each country in 2050
. Here are the rough steps:
- Create a script called
growth.py
. - Read
facts
into a Pandas Dataframe usingread_sql_query
. - Filter out any rows with invalid data, such as the
area_land
column being0
. - Write a function that takes in the initial population and the growth rate of a country, and outputs the final population. The annual population growth (expressed as a percentage) for each country is in the
population_growth
column. The initial population is in thepopulation
column.- The formula for compound annual population growth is N=N0e(rt)N=N0e(rt), where NN is the final population, N0N0 is the initial population, ee is a constant value you can access with
math.e
, rr is the rate of annual change, expressed as a decimal (so1.5
percent should be.015
), and tt is the number of years to calculate for. Assume that you'll be starting in January2015
, and you'll be ending in January2050
, or35
years. - Let's say you have a country with
5000
people, and a4
percent annual growth rate. The formula would look like N=5000∗e(.04∗35)N=5000∗e(.04∗35).
- The formula for compound annual population growth is N=N0e(rt)N=N0e(rt), where NN is the final population, N0N0 is the initial population, ee is a constant value you can access with
- Use the
apply
method on Pandas Dataframes to compute the population in2050
for each row in the data. - Use the Dataframe sort_values method to sort on the
2050
population in descending order. - Print the
10
countries that will have the highest projected populations in2050
6: Computing Total Area
You can compute totals of a column using theSUM
function in SQL queries. For example, you can select the total population
with:
SELECT SUM(population) from facts;
You can also add a WHERE
clause:
SELECT SUM(population) from facts WHERE
area_land != "";
Instructions
Use SQL and Python to compute the ratio of how much land area countries claim as their territory versus how much water area they claim. Here are the rough steps:
- Make a script called
area.py
. - Query to get the total of the
area_land
column. - Query to get the total of the
area_water
column`. - Divide
area_land
byarea_water
, and print the result
7: Next Steps
That's all for the guided steps, but feel free to keep going through the data and answering questions. We encourage you to think of your own questions, and to be creative in exploring the dataset!
Some interesting questions to get you started:
- Which countries will lose population over the next
35
years? - Which countries have the lowest/highest population density?
- Which countries receive the most immigrants? Which countries lose the most emigrants?
You can write scripts and explore here, or download the code to your computer using the download icon to the right:
We hope this guided project has been a good experience, and please email us at hello@dataquest.io if you want to share your work -- we'd love to see it!
import pandas as pd import sqlite3 | |
conn = sqlite3.connect('factbook.db') | |
c = conn.cursor() | |
#type my sql query here | |
c.execute('SELECT name FROM facts ORDER BY population LIMIT 10;') | |
print(c.fetchall()) |
import pandas as pd
| ||
conn = sqlite3.connect('factbook.db') | ||
c = conn.cursor() | ||
c.execute('SELECT SUM(area_land) FROM facts WHERE area_land != "";') | ||
d = c.fetchall() | ||
area_land = d[0][0] | ||
c.execute('SELECT SUM(area_water) FROM facts WHERE area_water != "";') | ||
e = c.fetchall() | ||
area_water = e[0][0] | ||
print(area_land/float(area_water)) |
import sqlite3 | |
import math import pandas as pd | |
conn = sqlite3.connect('factbook.db') | |
df = pd.read_sql_query('SELECT * FROM facts', conn) | |
df_clean = df.dropna(axis=0) | |
df_clean = df_clean[df_clean['area_land']>0] | |
growth_rate = df_clean['population_growth'] | |
init_pop = df_clean['population'] | |
def compound_growth_rate(pop, rate): | |
return pop*math.e**((rate/100)*35) | |
df_clean['pop2050'] = compound_growth_rate(init_pop, growth_rate) | |
df_clean_sort_pop2050 = df_clean.sort('pop2050', ascending=False) | |
print(df_clean_sort_pop2050[['name','population','pop2050']][:10]) |