Introduction to SQL

1: Databases

In previous missions, we've worked extensively with datasets that are stored in a single file, usually a CSV file. While CSV files are easy to interface with, they have a lot of limitations. As the data gets larger, it becomes more difficult to load the file into a computer's memory, which is how tools like Pandas work with data. CSV files also fall short at providing strict security for production applications (imagine if companies like Google or Facebook used CSV files to store and access data) and are optimized for static representation. If your data changes quickly, which is true for most technology companies, then you'll need to adopt a different method.

A database is a data store designed for storing, querying, and processing data. Databases store the data we want and expose an interface for interacting with that data. Most technology companies use databases to structure the data coming into the system and later query specific subsets of the data to answer questions or update existing data. Database systems also come with database management software with administrative controls, security and access controls, and a language to interface with the database.

In this course, we'll be focusing on a language called SQL, or Structured Query Language, which was designed to query, update, and modify data stored in a database.

SQL is the most common language for working with databases and is an important tool in any data professional's toolkit. While SQL is a language, it's quite different from languages like Python or R. SQL was built specifically for querying and interacting with databases and won't have much of the functionality you can expect in traditional programming languages. Since SQL is a declarative language, the user focuses on expressing what he or she wants and the computer focuses on figuring out how to perform the computation.

Before diving into SQL syntax, we'll introduce a few database concepts so you're aware of how the data is represented in a database and why SQL makes it easy to work with that data.

2: Tables, Rows, & Columns

A database is a collection of tables, where each table is made up of rows of data and each row has values for the same set of columnsacross the table. A table is very similar to a DataFrame in Pandas or how a regular CSV file is structured. Both have rows of values with a consistent set of columns.

We'll be working with the data from the American Community Survey on college majors and job outcomes. Here's a preview of recent-grads.csv, the dataset we'll be working with:

RankMajor_codeMajorMajor_categoryTotalSample_sizeMenWomenShareWomenEmployed
12419PETROLEUM ENGINEERINGEngineering23393620572820.1205641976
22416MINING AND MINERAL ENGINEERINGEngineering7567679770.101852640
32415METALLURGICAL ENGINEERINGEngineering85637251310.153037648
42417NAVAL ARCHITECTURE AND MARINE ENGINEERINGEngineering12581611231350.107313758
52405CHEMICAL ENGINEERINGEngineering3226028921239110210.34163125694

We have loaded the data in for years 2010-2012 for just recent college grads into a table named recent_grads in a database so we can explore how to query the data using SQL. You'll notice that the table contains the same columns for each row of data, with each row representing a major in college.

The full table has many more columns, 21 to be specific, than the ones displayed above and they're explained in further detail onFiveThirtyEight's Github repo.

Here are the descriptions of the columns in the above snapshot:

  • Rank - Rank by median earnings.
  • Major_code - Major code.
  • Major - Major description.
  • Major_category - Category of major.
  • Total - Total number of people with major.
  • Sample_size - Sample size (unweighted) of full-time.
  • Men - Male graduates.
  • Women - Female graduates.
  • ShareWomen - Women as share of total.
  • Employed - Number employed.

Let's dive into how to use SQL to query this database, which contains just this one table.

3: Querying

Writing a SQL query is the primary way of interacting with a database. A SQL query has to adhere to a defined structure and vocabulary that we use to define what we want the database to do. The SQL language has a set of general statements that you combine with specific logic to express the intent of that query.

The first and most basic statement in SQL is a SELECT statement. To specify that we want 10 specific columns for all of the rows from a specific table, we use the SELECT keyword along with the names of the 10 columns we want the database to return. You use a SELECTstatement whenever you want to return specific data from the database without editing or modifying the values in the database.

Let's explore the basic syntax for the SELECT statement.

 

SELECT [columnA, columnB, ...]

FROM tableName;

The SQL syntax reads more like English than a programming language like Python. The database converts your query to lower-level logic and returns the results back to you. Let's see what an actual SQL query looks like now. The following query selects the Rank and Majorcolumns from the table recent_grads, which represents the information from recent-grads.csv as a table in the database:

 

SELECT Rank,Major

FROM recent_grads;

The semi-colon (;) at the end of the query is required since it specifies where the query ends. This allow us to write a query either in one line or over multiple lines.

4: SQLite

We'll be working with SQLite, a lightweight database that's ideal for exploring and learning SQL. We'll dive more into how SQLite specifically works in a later mission, but for now we have taken care of setting up and loading the data into the database on our end.

Writing and running SQL queries in our interface is similar to writing and running Python code. Write the query in the code cell and then click check to execute the query against the database. The results are returned as a list of lists, where each inner list represents the values in a row. If you write multiple queries in a code cell, only the last query's results will be displayed.

Here's a preview of the results that SQLite returns:

 

[[1, "PETROLEUM ENGINEERING"], [2, "MINING AND MINERAL ENGINEERING"], [3, "METALLURGICAL ENGINEERING"], [4, "NAVAL ARCHITECTURE AND

    MARINE ENGINEERING"], [5, "CHEMICAL ENGINEERING"],...

Let's now practice writing and running a SQL query.

Instructions

  • Use the query from the previous step that returns the Rank andMajor columns from the tablerecent_grads.
  • Click check to see the results of the query

转载于:https://my.oschina.net/Bettyty/blog/747316

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值