Introduction to SQL

1: Databases

In previous missions, we've worked extensively with datasets that are stored in a single file, usually a CSV file. While CSV files are easy to interface with, they have a lot of limitations. As the data gets larger, it becomes more difficult to load the file into a computer's memory, which is how tools like Pandas work with data. CSV files also fall short at providing strict security for production applications (imagine if companies like Google or Facebook used CSV files to store and access data) and are optimized for static representation. If your data changes quickly, which is true for most technology companies, then you'll need to adopt a different method.

A database is a data store designed for storing, querying, and processing data. Databases store the data we want and expose an interface for interacting with that data. Most technology companies use databases to structure the data coming into the system and later query specific subsets of the data to answer questions or update existing data. Database systems also come with database management software with administrative controls, security and access controls, and a language to interface with the database.

In this course, we'll be focusing on a language called SQL, or Structured Query Language, which was designed to query, update, and modify data stored in a database.

SQL is the most common language for working with databases and is an important tool in any data professional's toolkit. While SQL is a language, it's quite different from languages like Python or R. SQL was built specifically for querying and interacting with databases and won't have much of the functionality you can expect in traditional programming languages. Since SQL is a declarative language, the user focuses on expressing what he or she wants and the computer focuses on figuring out how to perform the computation.

Before diving into SQL syntax, we'll introduce a few database concepts so you're aware of how the data is represented in a database and why SQL makes it easy to work with that data.

2: Tables, Rows, & Columns

A database is a collection of tables, where each table is made up of rows of data and each row has values for the same set of columnsacross the table. A table is very similar to a DataFrame in Pandas or how a regular CSV file is structured. Both have rows of values with a consistent set of columns.

We'll be working with the data from the American Community Survey on college majors and job outcomes. Here's a preview of recent-grads.csv, the dataset we'll be working with:

RankMajor_codeMajorMajor_categoryTotalSample_sizeMenWomenShareWomenEmployed
12419PETROLEUM ENGINEERINGEngineering23393620572820.1205641976
22416MINING AND MINERAL ENGINEERINGEngineering7567679770.101852640
32415METALLURGICAL ENGINEERINGEngineering85637251310.153037648
42417NAVAL ARCHITECTURE AND MARINE ENGINEERINGEngineering12581611231350.107313758
52405CHEMICAL ENGINEERINGEngineering3226028921239110210.34163125694

We have loaded the data in for years 2010-2012 for just recent college grads into a table named recent_grads in a database so we can explore how to query the data using SQL. You'll notice that the table contains the same columns for each row of data, with each row representing a major in college.

The full table has many more columns, 21 to be specific, than the ones displayed above and they're explained in further detail onFiveThirtyEight's Github repo.

Here are the descriptions of the columns in the above snapshot:

  • Rank - Rank by median earnings.
  • Major_code - Major code.
  • Major - Major description.
  • Major_category - Category of major.
  • Total - Total number of people with major.
  • Sample_size - Sample size (unweighted) of full-time.
  • Men - Male graduates.
  • Women - Female graduates.
  • ShareWomen - Women as share of total.
  • Employed - Number employed.

Let's dive into how to use SQL to query this database, which contains just this one table.

3: Querying

Writing a SQL query is the primary way of interacting with a database. A SQL query has to adhere to a defined structure and vocabulary that we use to define what we want the database to do. The SQL language has a set of general statements that you combine with specific logic to express the intent of that query.

The first and most basic statement in SQL is a SELECT statement. To specify that we want 10 specific columns for all of the rows from a specific table, we use the SELECT keyword along with the names of the 10 columns we want the database to return. You use a SELECTstatement whenever you want to return specific data from the database without editing or modifying the values in the database.

Let's explore the basic syntax for the SELECT statement.

 

SELECT [columnA, columnB, ...]

FROM tableName;

The SQL syntax reads more like English than a programming language like Python. The database converts your query to lower-level logic and returns the results back to you. Let's see what an actual SQL query looks like now. The following query selects the Rank and Majorcolumns from the table recent_grads, which represents the information from recent-grads.csv as a table in the database:

 

SELECT Rank,Major

FROM recent_grads;

The semi-colon (;) at the end of the query is required since it specifies where the query ends. This allow us to write a query either in one line or over multiple lines.

4: SQLite

We'll be working with SQLite, a lightweight database that's ideal for exploring and learning SQL. We'll dive more into how SQLite specifically works in a later mission, but for now we have taken care of setting up and loading the data into the database on our end.

Writing and running SQL queries in our interface is similar to writing and running Python code. Write the query in the code cell and then click check to execute the query against the database. The results are returned as a list of lists, where each inner list represents the values in a row. If you write multiple queries in a code cell, only the last query's results will be displayed.

Here's a preview of the results that SQLite returns:

 

[[1, "PETROLEUM ENGINEERING"], [2, "MINING AND MINERAL ENGINEERING"], [3, "METALLURGICAL ENGINEERING"], [4, "NAVAL ARCHITECTURE AND

    MARINE ENGINEERING"], [5, "CHEMICAL ENGINEERING"],...

Let's now practice writing and running a SQL query.

Instructions

  • Use the query from the previous step that returns the Rank andMajor columns from the tablerecent_grads.
  • Click check to see the results of the query

SELECT Rank,Major FROM recent_grads;

 

5: Specifying Column Order

SQL allows us to specify the order of columns in the returned results in the SELECT statement. Try swapping the order of the columns we specified in the previous query and click check to see the results.

Instructions

Modify the SQL query from the previous step so:

  • The Major value for each row is first,
  • The Rank value for each row is second

SELECT Major,Rank FROM recent_grads;

 

6: Practice: Select

When we used Major,Rank instead of Rank,Major in the SELECT statement from the previous step, you'll notice that the first value in each list was the major while the second value was the rank.

Now it's your turn to write a SQL query from scratch.

Instructions

Write a query that returns the following 5 columns in the order specified fromrecent_grads:

  • Rank
  • Major_code
  • Major
  • Major_category
  • Total

SELECT Rank,Major_code,Major,Major_category,Total FROM recent_grads;

7: Where

So far, we've been writing queries that return every row from the table but constrained to specific columns. If we wanted to figure out which majors had more female graduates than male graduates (when ShareWomen is larger than 0.5), we need a way to constrain the rows that are returned.

To filter rows by specific criteria, we need to use the WHERE statement. The WHERE statement requires 3 things:

  • The column we want the database to filter on: ShareWomen
  • A comparison operator to specify how we want a value in a column to be compared: >
  • The comparison value we want the database to compare each value to: 0.5

In the below query, we:

  • Use SELECT to specify the column filtering criteria: Major and ShareWomen
  • Use FROM to specify the table we want to query: recent_grads
  • Use WHERE to specify the row filtering criteria: ShareWomen > 0.5

SELECT Major,ShareWomen

FROM recent_grads

WHERE ShareWomen > 0.5;

Here are the comparison operators we can use:

  • Less than: <
  • Less than or equal to: <=
  • Greater than: >
  • Greater than or equal to: >=
  • Equal to: =
  • Not equal to: !=

The comparison value after the operator must either be text or a number depending on the field. ShareWomen is a numeric column, we don't need to wrap the number 0.5 with quotes. Lastly, most database systems require that the SELECT and FROM statements come first before any WHERE or other statements.

Instructions

Run the query that we explored above that returns the Major andShareWomen values for all rows whereShareWomen exceeded 0.5.

  • Ensure that all of the values forShareWomen (the second value in each inner list) are greater than0.5

SELECT Major,ShareWomen FROM recent_grads WHERE ShareWomen > 0.5;

 

8: Practice: Where

Now it's your turn to writing a SQL query that uses the WHERE statement to filter the results.

Instructions

Write a SQL query that returns all majors that have more than 10000people employed with that background.

  • In the SELECT statement, specify that we only want the values from the Major and Employedcolumns (in that order)

SELECT Major,Employed FROM recent_grads WHERE Employed >10000;

9: Limit

Many queries return a large number of results, which can be cumbersome to work with. SQL comes with a statement called LIMIT that allows us to specify how many results we'd like the database to return.

To use the LIMIT statement, we need to specify the number of results that are returned as an integer value. The following query returns the first 5 values in the Major column:

SELECT Major FROM recent_grads LIMIT 5;

Here's the result of that query:

[["PETROLEUM ENGINEERING"], ["MINING AND MINERAL ENGINEERING"], ["METALLURGICAL ENGINEERING"], ["NAVAL ARCHITECTURE AND MARINE

    ENGINEERING"], ["CHEMICAL ENGINEERING"]]

Instructions

Write a query that returns:

  • the Major column
  • where Employed exceeds 10000
  • only the first 10 results

SELECT Major FROM recent_grads WHERE Employed >10000 LIMIT 10;

 

 

10: Next Steps

We covered the basics of databases and SQL syntax in this lesson and have seen that SQL is an expressive language for working with data. In the next lesson, we'll learn about how to combine multiple filtering criteria together to express more complex logic in SQL

转载于:https://my.oschina.net/Bettyty/blog/747332

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值