1: Databases
In previous missions, we've worked extensively with datasets that are stored in a single file, usually a CSV file. While CSV files are easy to interface with, they have a lot of limitations. As the data gets larger, it becomes more difficult to load the file into a computer's memory, which is how tools like Pandas work with data. CSV files also fall short at providing strict security for production applications (imagine if companies like Google or Facebook used CSV files to store and access data) and are optimized for static representation. If your data changes quickly, which is true for most technology companies, then you'll need to adopt a different method.
A database is a data store designed for storing, querying, and processing data. Databases store the data we want and expose an interface for interacting with that data. Most technology companies use databases to structure the data coming into the system and later query specific subsets of the data to answer questions or update existing data. Database systems also come with database management software with administrative controls, security and access controls, and a language to interface with the database.
In this course, we'll be focusing on a language called SQL, or Structured Query Language, which was designed to query, update, and modify data stored in a database.
SQL is the most common language for working with databases and is an important tool in any data professional's toolkit. While SQL is a language, it's quite different from languages like Python or R. SQL was built specifically for querying and interacting with databases and won't have much of the functionality you can expect in traditional programming languages. Since SQL is a declarative language, the user focuses on expressing what he or she wants and the computer focuses on figuring out how to perform the computation.
Before diving into SQL syntax, we'll introduce a few database concepts so you're aware of how the data is represented in a database and why SQL makes it easy to work with that data.
2: Tables, Rows, & Columns
A database is a collection of tables, where each table is made up of rows of data and each row has values for the same set of columnsacross the table. A table is very similar to a DataFrame in Pandas or how a regular CSV file is structured. Both have rows of values with a consistent set of columns.
We'll be working with the data from the American Community Survey on college majors and job outcomes. Here's a preview of recent-grads.csv
, the dataset we'll be working with:
Rank | Major_code | Major | Major_category | Total | Sample_size | Men | Women | ShareWomen | Employed |
---|---|---|---|---|---|---|---|---|---|
1 | 2419 | PETROLEUM ENGINEERING | Engineering | 2339 | 36 | 2057 | 282 | 0.120564 | 1976 |
2 | 2416 | MINING AND MINERAL ENGINEERING | Engineering | 756 | 7 | 679 | 77 | 0.101852 | 640 |
3 | 2415 | METALLURGICAL ENGINEERING | Engineering | 856 | 3 | 725 | 131 | 0.153037 | 648 |
4 | 2417 | NAVAL ARCHITECTURE AND MARINE ENGINEERING | Engineering | 1258 | 16 | 1123 | 135 | 0.107313 | 758 |
5 | 2405 | CHEMICAL ENGINEERING | Engineering | 32260 | 289 | 21239 | 11021 | 0.341631 | 25694 |
We have loaded the data in for years 2010-2012 for just recent college grads into a table named recent_grads
in a database so we can explore how to query the data using SQL. You'll notice that the table contains the same columns for each row of data, with each row representing a major in college.
The full table has many more columns, 21 to be specific, than the ones displayed above and they're explained in further detail onFiveThirtyEight's Github repo.
Here are the descriptions of the columns in the above snapshot:
Rank
- Rank by median earnings.Major_code
- Major code.Major
- Major description.Major_category
- Category of major.Total
- Total number of people with major.Sample_size
- Sample size (unweighted) of full-time.Men
- Male graduates.Women
- Female graduates.ShareWomen
- Women as share of total.Employed
- Number employed.
Let's dive into how to use SQL to query this database, which contains just this one table.
3: Querying
Writing a SQL query is the primary way of interacting with a database. A SQL query has to adhere to a defined structure and vocabulary that we use to define what we want the database to do. The SQL language has a set of general statements that you combine with specific logic to express the intent of that query.
The first and most basic statement in SQL is a SELECT
statement. To specify that we want 10 specific columns for all of the rows from a specific table, we use the SELECT
keyword along with the names of the 10 columns we want the database to return. You use a SELECT
statement whenever you want to return specific data from the database without editing or modifying the values in the database.
Let's explore the basic syntax for the SELECT
statement.
SELECT [columnA, columnB, ...]
FROM tableName;
The SQL syntax reads more like English than a programming language like Python. The database converts your query to lower-level logic and returns the results back to you. Let's see what an actual SQL query looks like now. The following query selects the Rank
and Major
columns from the table recent_grads
, which represents the information from recent-grads.csv
as a table in the database:
SELECT Rank,Major
FROM recent_grads;
The semi-colon (;
) at the end of the query is required since it specifies where the query ends. This allow us to write a query either in one line or over multiple lines.
4: SQLite
We'll be working with SQLite, a lightweight database that's ideal for exploring and learning SQL. We'll dive more into how SQLite specifically works in a later mission, but for now we have taken care of setting up and loading the data into the database on our end.
Writing and running SQL queries in our interface is similar to writing and running Python code. Write the query in the code cell and then click check to execute the query against the database. The results are returned as a list of lists, where each inner list represents the values in a row. If you write multiple queries in a code cell, only the last query's results will be displayed.
Here's a preview of the results that SQLite returns:
[[1, "PETROLEUM ENGINEERING"], [2, "MINING AND MINERAL ENGINEERING"], [3, "METALLURGICAL ENGINEERING"], [4, "NAVAL ARCHITECTURE AND
MARINE ENGINEERING"], [5, "CHEMICAL ENGINEERING"],...
Let's now practice writing and running a SQL query.
Instructions
- Use the query from the previous step that returns the
Rank
andMajor
columns from the tablerecent_grads
. - Click check to see the results of the query