Guided Project: Working With A SQLite Database

1: The Dataset

In this project, you'll be working with the CIA World Factbook, a compendium of country facts. The Factbook contains demographic information for each country in the world, including:

  • population -- the population as of 2015.
  • population_growth -- the annual population growth rate, as a percentage.
  • area -- the total land and water area.

You can download the Factbook as a SQLite database here if you want to work with it on your own computer. In this guided project, you'll be working with Python and the SQLite command line tool, to connect to the database, extract data, and perform analysis.

Instructions

For now, just hit "Next" to get started with the project!

 

2: The SQLite Command Shell

SQLite is a relational database management system that enables you to create databases and query them using SQL syntax. SQLite is simpler than full database systems like MySQL and PostgreSQL. SQLite is good for cases where ease of use is more important than performance. Each SQLite database is stored as a single file, making it easy to transport.

The Factbook database is stored in the filefactbook.dbdb is a file extension that is short for database.

If you're in the same folder as factbook.db, and you type sqlite3 factbook.db on the command line, you'll open the Factbook database in the SQLite Command Shell. This enables you to manage the database and run SQL queries.

Instructions

Use the SQLite Shell to explore factbook.db.

  • Connect to factbook.db using the SQLite shell.
  • Type .help into the shell to see a list of commands you can run in the shell.
  • Type .tables to see a list of the tables in the database.
  • If you type .header on, you'll see the column headers when you run queries.

    If you type an incomplete command (missing ending ; for example), your command won't be executed and you'll be instead taken to an indented line. Type ; in the indented line (and press Enter) to exit the indentation and run the command. Here's an example:

Imgur

When you're done with the Command Shell, you can type .quit to quit. Don't quit the shell just yet since you'll be using it in the next step

 

3: Running Queries

The SQLite Command Shell also allows you to run any valid SQL query. For example, you could run the following:

 

SELECT * FROM facts;

This will show you all the rows in the factstable.

Instructions

Run some queries in the SQLite Command Shell. Make sure to turn headers on with .header on to see headers for each column.

You should think of your own queries, but here are some examples:

  • SELECT * FROM facts ORDER BY population DESC LIMIT 10;
  • SELECT * FROM facts ORDER BY area_land ASC LIMIT 10;

You may notice that these queries will show you some strange results, such as Ethiopia having the least land area. The queries also include non-national entities like the European Union and Akrotiri.

The data is fairly messy, and some values in the area_land column are missing. Add WHERE area_land != "" to the query before theORDER BY clause to remove the invalid rows. You may also need to try additional filtering.

When you're done exploring, you can quit the SQLite Command Shell with .quit

4: Using Python With SQLite

The sqlite3 library, which comes default in Python, allows us to connect to SQLite databases. To do this, we open a database connection, then create an object that can run queries.

For example, this will let us connect tofactbook.db, and select all of the rows:

 

import sqlite3

conn = sqlite3.connect('factbook.db')

c = conn.cursor()

c.execute('SELECT * FROM facts;')

print(c.fetchall())

The code above creates a Connection object. We then create a Cursor instance, which can execute queries. Finally, we execute a query and display the results using the print function. For more depth, read the package documentation here.

Instructions

  • Write code in query.py that will select the 10 countries with the least population from the facts table, and then print the results.

  • Execute query.py from the command line by running python query.py.

 

5: Computing Projected Population

You can read the results of a SQL query into a Pandas Dataframe using the read_sql_queryfunction, which is documented here. Theread_sql_query function takes a SQL query string and a connection object, and returns a Dataframe containing all the rows and columns from the query.

When Pandas reads in columns this way, it sets the type of the column to be the type of the column in the database. Some columns, likearea_land, which had blank entries in the database, will have NaN values in a DataFrame, corresponding to "Not a Number". This is because Pandas can't have blanks in numeric columns, and uses NaN as a way to signify an invalid or missing value.

See this Pandas documentation on how to work with missing data. The best way to deal with it for now is to just use the dropna method, with the axis=0 argument, which will drop any rows that have NaN values in them.

Instructions

Read the facts table into Pandas, and then compute the projected population for each country in 2050. Here are the rough steps:

  • Create a script called growth.py.
  • Read facts into a Pandas Dataframe using read_sql_query.
  • Filter out any rows with invalid data, such as the area_landcolumn being 0.
  • Write a function that takes in the initial population and the growth rate of a country, and outputs the final population. The annual population growth (expressed as a percentage) for each country is in the population_growth column. The initial population is in the population column.
    • The formula for compound annual population growth is N=N0e(rt)N=N0e(rt), where NN is the final population, N0N0 is the initial population, ee is a constant value you can access with math.e, rr is the rate of annual change, expressed as a decimal (so 1.5 percent should be .015), and tt is the number of years to calculate for. Assume that you'll be starting in January 2015, and you'll be ending in January2050, or 35 years.
    • Let's say you have a country with 5000 people, and a 4percent annual growth rate. The formula would look like N=5000∗e(.04∗35)N=5000∗e(.04∗35).
  • Use the apply method on Pandas Dataframes to compute the population in 2050 for each row in the data.
  • Use the Dataframe sort_values method to sort on the 2050population in descending order.
  • Print the 10 countries that will have the highest projected populations in 2050

6: Computing Total Area

You can compute totals of a column using theSUM function in SQL queries. For example, you can select the total population with:

 

SELECT SUM(population) from facts;

You can also add a WHERE clause:

 

SELECT SUM(population) from facts WHERE

    area_land != "";

Instructions

Use SQL and Python to compute the ratio of how much land area countries claim as their territory versus how much water area they claim. Here are the rough steps:

  • Make a script called area.py.
  • Query to get the total of the area_land column.
  • Query to get the total of the area_water column`.
  • Divide area_land by area_water, and print the result

7: Next Steps

That's all for the guided steps, but feel free to keep going through the data and answering questions. We encourage you to think of your own questions, and to be creative in exploring the dataset!

Some interesting questions to get you started:

  • Which countries will lose population over the next 35 years?
  • Which countries have the lowest/highest population density?
  • Which countries receive the most immigrants? Which countries lose the most emigrants?

You can write scripts and explore here, or download the code to your computer using the download icon to the right:

Imgur

We hope this guided project has been a good experience, and please email us at hello@dataquest.io if you want to share your work -- we'd love to see it!

 
 

import pandas as pd

import sqlite3

  
 conn = sqlite3.connect('factbook.db')
  
 c = conn.cursor()
 #type my sql query here
 c.execute('SELECT name FROM facts ORDER BY population LIMIT 10;')
 print(c.fetchall())
 
 

import pandas as pd

import sqlite3
  
 conn = sqlite3.connect('factbook.db')
 c = conn.cursor()
  
 c.execute('SELECT SUM(area_land) FROM facts WHERE area_land != "";')
 d = c.fetchall()
 area_land = d[0][0]
  
 c.execute('SELECT SUM(area_water) FROM facts WHERE area_water != "";')
 e = c.fetchall()
 area_water = e[0][0]
  
 print(area_land/float(area_water))
 
 import sqlite3
 

import math

import pandas as pd

  
 conn = sqlite3.connect('factbook.db')
  
 df = pd.read_sql_query('SELECT * FROM facts', conn)
 df_clean = df.dropna(axis=0)
 df_clean = df_clean[df_clean['area_land']>0]
  
 growth_rate = df_clean['population_growth']
 init_pop = df_clean['population']
  
 def compound_growth_rate(pop, rate):
 return pop*math.e**((rate/100)*35)
  
 df_clean['pop2050'] = compound_growth_rate(init_pop, growth_rate)
  
 df_clean_sort_pop2050 = df_clean.sort('pop2050', ascending=False)
 print(df_clean_sort_pop2050[['name','population','pop2050']][:10])

 

 

转载于:https://my.oschina.net/Bettyty/blog/747365

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值