DataCamp课程:Joining Data in SQL

Joining Data in SQL

1. Introduction to joins

</> Inner join

Begin by selecting all columns from the cities table.

SELECT *
FROM cities;

name		country_code	city_proper_pop	metroarea_pop	urbanarea_pop
Abidjan		CIV				4765000			null			4765000
Abu Dhabi	ARE				1145000			null			1145000
Abuja		NGA				1235880			6000000			1235880
...

Inner join the cities table on the left to the countries table on the right, keeping all of the fields in both tables.

You should match the tables on the country_code field in cities and the code field in countries.

Do not alias your tables here or in the next step. Using cities and countries is fine for now.

SELECT * 
FROM cities
INNER JOIN countries
ON cities.country_code = countries.code;

name		country_code	city_proper_pop	metroarea_pop	urbanarea_pop	code	name					continent	region			surface_area	indep_year	local_name							gov_form			capital			cap_long	cap_lat
Abidjan		CIV				4765000			null			4765000			CIV		Cote d'Ivoire			Africa		Western Africa	322463			1960		Cote d’Ivoire						Republic			Yamoussoukro	-4.0305		5.332
Abu Dhabi	ARE				1145000			null			1145000			ARE		United Arab Emirates	Asia		Middle East		83600			1971		Al-Imarat al-´Arabiya al-Muttahida	Emirate Federation	Abu Dhabi		54.3705		24.4764
Abuja		NGA				1235880			6000000			1235880			NGA		Nigeria					Africa		Western Africa	923768			1960		Nigeria								Federal Republic	Abuja			7.48906		9.05804
...

Modify the SELECT statement to keep only the name of the city, the name of the country, and the name of the region the country resides in.

Recall from our Intro to SQL for Data Science course that you can alias fields using AS. Alias the name of the city AS city and the name of the country AS country.

SELECT cities.name AS city, 
countries.name AS country, 
countries.region
FROM cities
INNER JOIN countries
ON cities.country_code = countries.code;

city		country					region
Abidjan		Cote d'Ivoire			Western Africa
Abu Dhabi	United Arab Emirates	Middle East
Abuja		Nigeria					Western Africa
...

Join the tables countries (left) and economies (right) aliasing countries AS c and economies AS e.

Specify the field to match the tables ON.

From this join, SELECT:

  • c.code, aliased as country_code.
  • name, year, and inflation_rate, not aliased
SELECT c.code AS country_code, c.name, e.year, e.inflation_rate
FROM countries AS c
INNER JOIN economies AS e
ON c.code = e.code;

country_code	name		year	inflation_rate
AFG				Afghanistan	2010	2.179
AFG				Afghanistan	2015	-1.549
AGO				Angola		2010	14.48
...

Inner join countries (left) and populations (right) on the code and country_code fields respectively.

Alias countries AS c and populations AS p.

Select code, name, and region from countries and also select year and fertility_rate from populations (5 fields in total).

SELECT c.code, c.name, c.region, p.year, p.fertility_rate
FROM countries AS c
INNER JOIN populations AS p
ON c.code=p.country_code;

code	name		region						year	fertility_rate
ABW		Aruba		Caribbean					2010	1.704
ABW		Aruba		Caribbean					2015	1.647
AFG		Afghanistan	Southern and Central Asia	2010	5.746
...

Add an additional inner join with economies to your previous query by joining on code.

Include the unemployment_rate column that became available through joining with economies.

Note that year appears in both populations and economies, so you have to explicitly use e.year instead of year as you did before.

SELECT c.code, name, region, e.year, fertility_rate, unemployment_rate
FROM countries AS c
INNER JOIN populations AS p
ON c.code = p.country_code
INNER JOIN economies AS e
ON c.code = e.code;

code	name		region						year	fertility_rate	unemployment_rate
AFG		Afghanistan	Southern and Central Asia	2010	4.653			null
AFG		Afghanistan	Southern and Central Asia	2010	5.746			null
AFG		Afghanistan	Southern and Central Asia	2015	4.653			null
...

Scroll down the query result and take a look at the results for Albania from your previous query. Does something seem off to you?

The trouble with doing your last join on c.code = e.code and not also including year is that e.g. the 2010 value for fertility_rate is also paired with the 2015 value for unemployment_rate.

Fix your previous query: in your last ON clause, use AND to add an additional joining condition. In addition to joining on code in c and e, also join on year in e and p.

SELECT c.code, name, region, e.year, fertility_rate, unemployment_rate
FROM countries AS c
INNER JOIN populations AS p
ON c.code = p.country_code
INNER JOIN economies AS e
ON c.code = e.code AND p.year = e.year;

code	name		region						year	fertility_rate	unemployment_rate
AFG		Afghanistan	Southern and Central Asia	2010	5.746			null
AFG		Afghanistan	Southern and Central Asia	2015	4.653			null
AGO		Angola		Central Africa				2010	6.416			null
...

</> Review inner join using on

Why does the following code result in an error?

SELECT c.name AS country, l.name AS language
FROM countries AS c
  INNER JOIN languages AS l;
  • The languages table has more rows than the countries table.
  • There are multiple languages spoken in many countries.
  • INNER JOIN requires a specification of the key field (or fields) in each table.
  • Join queries may not be followed by a semi-colon.

</> Inner join with using

Inner join countries on the left and languages on the right with USING(code).

Select the fields corresponding to:

  • country name AS country,
  • continent name,
  • language name AS language, and
  • whether or not the language is official.

whether or not the language is official.

SELECT c.name AS country, continent, l.name AS language, official
FROM countries AS c
INNER JOIN languages AS l
USING(code);

country	continent	language	official
Afghanistan	Asia	Dari		true
Afghanistan	Asia	Pashto		true
Afghanistan	Asia	Turkic		false
...

</> Self-join

Join populations with itself ON country_code.

Select the country_code from p1 and the size field from both p1 and p2. SQL won’t allow same-named fields, so alias p1.size as size2010 and p2.size as size2015.

SELECT p1.country_code,
	   p1.size AS size2010,
	   p2.size AS size2015
FROM populations AS p1
INNER JOIN populations AS p2
ON p1.country_code = p2.country_code;

country_code	size2010	size2015
ABW				101597		103889
ABW				101597		101597
ABW				103889		103889
...

Notice from the result that for each country_code you have four entries laying out all combinations of 2010 and 2015.

  • Extend the ON in your query to include only those records where the p1.year (2010) matches with p2.year - 5 (2015 - 5 = 2010). This will omit the three entries per country_code that you aren’t interested in.
SELECT p1.country_code,
       p1.size AS size2010,
       p2.size AS size2015
FROM populations AS p1
INNER JOIN populations AS p2
ON p1.country_code = p2.country_code
	AND p1.year = p2.year - 5;

country_code	size2010	size2015
ABW				101597		103889
AFG				27962200	32526600
AGO				21220000	25022000
...

As you just saw, you can also use SQL to calculate values like p2.year - 5 for you. With two fields like size2010 and size2015, you may want to determine the percentage increase from one field to the next:

With two numeric fields A and B, the percentage growth from A to B can be calculated as (B - A) / A * 100.0.

Add a new field to SELECT, aliased as growth_perc, that calculates the percentage population growth from 2010 to 2015 for each country, using p2.size and p1.size.

SELECT p1.country_code,
       p1.size AS size2010, 
       p2.size AS size2015,
       ((p2.size - p1.size)/p1.size * 100.0) AS growth_perc
FROM populations AS p1
INNER JOIN populations AS p2
ON p1.country_code = p2.country_code
	AND p1.year = p2.year - 5;

country_code	size2010	size2015	growth_perc
ABW				101597		103889		2.25597210228443
AFG				27962200	32526600	16.32329672575
AGO				21220000	25022000	17.9171919822693
...

</> Case when and then

Using the countries table, create a new field AS geosize_group that groups the countries into three groups:

  • If surface_area is greater than 2 million, geosize_group is ‘large’.
  • If surface_area is greater than 350 thousand but not larger than 2 million, geosize_group is ‘medium’.
  • Otherwise, geosize_group is ‘small’.
SELECT name, continent, code, surface_area,
CASE WHEN surface_area > 2000000 THEN 'large'
     WHEN surface_area > 350000 THEN 'medium'
     ELSE 'small' END
AS geosize_group
FROM countries;

name		continent	code	surface_area	geosize_group
Afghanistan	Asia		AFG		652090			medium
Netherlands	Europe		NLD		41526			small
Albania		Europe		ALB		28748			small
...

</> Inner challenge

Using the populations table focused only for the year 2015, create a new field aliased as popsize_group to organize population size into

  • ’large’ (> 50 million),
  • ’medium’ (> 1 million), and
  • ’small’ groups.

Select only the country code, population size, and this new popsize_group as fields.

SELECT country_code, size,
    CASE WHEN size > 50000000 THEN 'large'
        WHEN size > 1000000 THEN 'medium'
        ELSE 'small' END
        AS popsize_group
FROM populations
WHERE year = 2015;

country_code	size		popsize_group
ABW				103889		small
AFG				32526600	medium
AGO				25022000	medium
...

Use INTO to save the result of the previous query as pop_plus. You can see an example of this in the countries_plus code in the assignment text. Make sure to include a ; at the end of your WHERE clause!

Then, include another query below your first query to display all the records in pop_plus using SELECT * FROM pop_plus; so that you generate results and this will display pop_plus in query result.

SELECT country_code, size,
    CASE WHEN size > 50000000 THEN 'large'
        WHEN size > 1000000 THEN 'medium'
        ELSE 'small' END
        AS popsize_group
INTO pop_plus
FROM populations
WHERE year = 2015;

SELECT * FROM pop_plus;

country_code	size		popsize_group
ABW				103889		small
AFG				32526600	medium
AGO				25022000	medium
...

Keep the first query intact that creates pop_plus using INTO.

Write a query to join countries_plus AS c on the left with pop_plus AS p on the right matching on the country code fields.

Sort the data based on geosize_group, in ascending order so that large appears on top.

Select the name, continent, geosize_group, and popsize_group fields.

SELECT country_code, size,
  CASE WHEN size > 50000000
            THEN 'large'
       WHEN size > 1000000
            THEN 'medium'
       ELSE 'small' END
       AS popsize_group
INTO pop_plus       
FROM populations
WHERE year = 2015;

SELECT name, continent, geosize_group, popsize_group
FROM countries_plus AS c
  INNER JOIN pop_plus AS p
    ON c.code = p.country_code 
ORDER BY geosize_group;

name			continent		geosize_group	popsize_group
Canada			North America	large			medium
United States	North America	large			large
Greenland		North America	large			small
...

2. Outer joins and cross joins

</> Left Join

Fill in the code based on the instructions in the code comments to complete the inner join. Note how many records are in the result of the join in the query result tab.

SELECT c1.name AS city, code, c2.name AS country,
       region, city_proper_pop
FROM cities AS c1
  INNER JOIN countries AS c2
    ON c1.country_code = c2.code
ORDER BY code DESC;

city		code	country			region			city_proper_pop
Harare		ZWE		Zimbabwe		Eastern Africa	1606000
Lusaka		ZMB		Zambia			Eastern Africa	1742980
Cape Town	ZAF		South Africa	Southern Africa	3740030
...
[Showing 100 out of 230 rows]

Change the code to perform a LEFT JOIN instead of an INNER JOIN. After executing this query, note how many records the query result contains.

SELECT c1.name AS city, code, c2.name AS country,
       region, city_proper_pop
FROM cities AS c1
  LEFT JOIN countries AS c2
    ON c1.country_code = c2.code
ORDER BY code DESC;

city		code	country	region	city_proper_pop
Taichung	null	null	null	2752410
Tainan		null	null	null	1885250
Kaohsiung	null	null	null	2778920
...
[Showing 100 out of 236 rows]

Perform an inner join. Alias the name of the country field as country and the name of the language field as language.

Sort based on descending country name.

SELECT c.name AS country, local_name, l.name AS language, percent
FROM countries AS c
  INNER JOIN languages AS l
    ON c.code = l.code
ORDER BY country DESC;

country		local_name	language	percent
Zimbabwe	Zimbabwe	Shona		null
Zimbabwe	Zimbabwe	Venda		null
Zimbabwe	Zimbabwe	Tswana		null
...
[Showing 100 out of 909 rows]

Perform a left join instead of an inner join. Observe the result, and also note the change in the number of records in the result.

Carefully review which records appear in the left join result, but not in the inner join result.

SELECT c.name AS country, local_name, l.name AS language, percent
FROM countries AS c
  LEFT JOIN languages AS l
    ON c.code = l.code
ORDER BY country DESC;

country		local_name	language	percent
Zimbabwe	Zimbabwe	Chibarwe	null
Zimbabwe	Zimbabwe	Shona		null
Zimbabwe	Zimbabwe	Ndebele		null
...
[Showing 100 out of 916 rows]

Begin with a left join with the countries table on the left and the economies table on the right.

Focus only on records with 2010 as the year.

SELECT name, region, gdp_percapita
FROM countries AS c
  LEFT JOIN economies AS e
    ON c.code = e.code
WHERE year = 2010;

name		region						gdp_percapita
Afghanistan	Southern and Central Asia	539.667
Angola		Central Africa				3599.27
Albania		Southern Europe				4098.13
...
[Showing 100 out of 184 rows]

Modify your code to calculate the average GDP per capita AS avg_gdp for each region in 2010.

Select the region and avg_gdp fields.

SELECT region, avg(gdp_percapita) AS avg_gdp
FROM countries AS c
  LEFT JOIN economies AS e
    ON c.code = e.code
WHERE year = 2010
GROUP BY region;

region						avg_gdp
Southern Africa				5051.59797363281
Australia and New Zealand	44792.384765625
Southeast Asia				10547.1541320801
...
[Showing 23 out of 23 rows]

Arrange this data on average GDP per capita for each region in 2010 from highest to lowest average GDP per capita.

SELECT region, AVG(gdp_percapita) AS avg_gdp
FROM countries AS c
  LEFT JOIN economies AS e
    ON c.code = e.code
WHERE year = 2010
GROUP BY region
ORDER BY avg_gdp DESC;

region				avg_gdp
Western Europe		58130.9614955357
Nordic Countries	57073.99765625
North America		47911.509765625
...

</> Right join

The left join code is commented out here. Your task is to write a new query using rights joins that produces the same result as what the query using left joins produces. Keep this left joins code commented as you write your own query just below it using right joins to solve the problem.

Note the order of the joins matters in your conversion to using right joins!

SELECT cities.name AS city, urbanarea_pop, countries.name AS country,
       indep_year, languages.name AS language, percent
FROM languages
  RIGHT JOIN countries
    ON languages.code = countries.code
  RIGHT JOIN cities
    ON countries.code = cities.country_code
ORDER BY city, language;

city		urbanarea_pop	country					indep_year	language	percent
Abidjan		4765000			Cote d'Ivoire			1960		French		null
Abidjan		4765000			Cote d'Ivoire			1960		Other		null
Abu Dhabi	1145000			United Arab Emirates	1971		Arabic		null
...
[Showing 100 out of 1375 rows]

</> Full join

Choose records in which region corresponds to North America or is NULL.

SELECT name AS country, code, region, basic_unit
FROM countries
  FULL JOIN currencies
    USING (code)
WHERE region = 'North America' OR region IS NULL
ORDER BY region;

country			code	region			basic_unit
Bermuda			BMU		North America	Bermudian dollar
United States	USA		North America	United States dollar
Canada			CAN		North America	Canadian dollar
...
[Showing 18 out of 18 rows]

Repeat the same query as above but use a LEFT JOIN instead of a FULL JOIN. Note what has changed compared to the FULL JOIN result!

SELECT name AS country, code, region, basic_unit
FROM countries
  LEFT JOIN currencies
    USING (code)
WHERE region = 'North America' OR region IS NULL
ORDER BY region;

country			code	region			basic_unit
Bermuda			BMU		North America	Bermudian dollar
Canada			CAN		North America	Canadian dollar
United States	USA		North America	United States dollar
Greenland		GRL		North America	null

Repeat the same query as above but use an INNER JOIN instead of a FULL JOIN. Note what has changed compared to the FULL JOIN and LEFT JOIN results!

SELECT name AS country, code, region, basic_unit
FROM countries
  INNER JOIN currencies
    USING (code)
WHERE region = 'North America' OR region IS NULL
ORDER BY region;

country			code	region			basic_unit
Bermuda			BMU		North America	Bermudian dollar
Canada			CAN		North America	Canadian dollar
United States	USA		North America	United States dollar

Choose records in which countries.name starts with the capital letter ‘V’ or is NULL.

Arrange by countries.name in ascending order to more clearly see the results.

SELECT countries.name, code, languages.name AS language
FROM languages
  FULL JOIN countries
    USING (code)
WHERE countries.name LIKE 'V%' OR countries.name IS NULL
ORDER BY countries.name;

name		code	language
Vanuatu		VUT		Tribal Languages
Vanuatu		VUT		English
Vanuatu		VUT		French
...
[Showing 58 out of 58 rows]

Repeat the same query as above but use a left join instead of a full join. Note what has changed compared to the full join result!

SELECT countries.name, code, languages.name AS language
FROM languages
  LEFT JOIN countries
    USING (code)
WHERE countries.name LIKE 'V%' OR countries.name IS NULL
ORDER BY countries.name;

name		code	language
Vanuatu		VUT		English
Vanuatu		VUT		Other
Vanuatu		VUT		French
...
[Showing 56 out of 56 rows]

Repeat once more, but use an inner join instead of a left join. Note what has changed compared to the full join and left join results.

SELECT countries.name, code, languages.name AS language
FROM languages
  INNER JOIN countries
    USING (code)
WHERE countries.name LIKE 'V%' OR countries.name IS NULL
ORDER BY countries.name;

name		code	language
Vanuatu		VUT		Tribal Languages
Vanuatu		VUT		Bislama
Vanuatu		VUT		English
...
[Showing 10 out of 10 rows]

Complete a full join with countries on the left and languages on the right.

Next, full join this result with currencies on the right.

Use LIKE to choose the Melanesia and Micronesia regions (Hint: ‘M%esia’).

Select the fields corresponding to the country name AS country, region, language name AS language, and basic and fractional units of currency.

SELECT c1.name AS country, region, l.name AS language,
       basic_unit, frac_unit
FROM countries AS c1
  FULL JOIN languages AS l
    USING (code)
  FULL JOIN currencies AS c2
    USING (code)
WHERE region LIKE 'M%esia';

country				region		language	basic_unit				frac_unit
Kiribati			Micronesia	English		Australian dollar		Cent
Kiribati			Micronesia	Kiribati	Australian dollar		Cent
Marshall Islands	Micronesia	Other		United States dollar	Cent
...

</> Review outer joins

A(n) ___ join is a join combining the results of a ___ join and a ___ join.

  • left, full, right
  • right, full, left
  • inner, left, right
  • None of the above are true

</> A table of two cities

Create the cross join as described above. (Recall that cross joins do not use ON or USING.)

Make use of LIKE and Hyder% to choose Hyderabad in both countries.

Select only the city name AS city and language name AS language.

SELECT c.name AS city, l.name AS language
FROM cities AS c        
  CROSS JOIN languages AS l
WHERE c.name LIKE 'Hyder%';

city				language
Hyderabad (India)	Dari
Hyderabad			Dari
Hyderabad (India)	Pashto
...
[Showing 100 out of 1910 rows]

Use an inner join instead of a cross join. Think about what the difference will be in the results for this inner join result and the one for the cross join.

SELECT c.name AS city, l.name AS language
FROM cities AS c
  INNER JOIN languages AS l
    ON c.country_code = l.code
WHERE c.name LIKE 'Hyder%';

city				language
Hyderabad (India)	Hindi
Hyderabad (India)	Bengali
Hyderabad (India)	Telugu
...
[Showing 25 out of 25 rows]

</> Outer challenge

Select country name AS country, region, and life expectancy AS life_exp.

Make sure to use LEFT JOIN, WHERE, ORDER BY, and LIMIT.

SELECT c.name AS country, region, life_expectancy AS life_exp
FROM countries AS c
  LEFT JOIN populations AS p
    ON c.code = p.country_code
WHERE year = 2010
ORDER BY life_exp
LIMIT 5;

country						region				life_exp
Lesotho						Southern Africa		47.4834
Central African Republic	Central Africa		47.6253
Sierra Leone				Western Africa		48.229
Swaziland					Southern Africa		48.3458
Zimbabwe					Eastern Africa		49.5747

3. Set theory clauses

</> Union

Combine these two tables into one table containing all of the fields in economies2010. The economies table is also included for reference.

Sort this resulting single table by country code and then by year, both in ascending order.

SELECT *
  FROM economies2010
	UNION
SELECT *
  FROM economies2015
ORDER BY code, year;

code	year	income_group		gross_savings
AFG		2010	Low income			37.133
AFG		2015	Low income			21.466
AGO		2010	Upper middle income	23.534
...
[Showing 100 out of 380 rows]

Determine all (non-duplicated) country codes in either the cities or the currencies table. The result should be a table with only one field called country_code.

Sort by country_code in alphabetical order.

SELECT country_code
  FROM cities
	UNION
SELECT code
  FROM currencies
ORDER BY country_code;

country_code
ABW
AFG
AGO
...

</> Union all

Determine all combinations (include duplicates) of country code and year that exist in either the economies or the populations tables. Order by code then year.

The result of the query should only have two columns/fields. Think about how many records this query should result in.

You’ll use code very similar to this in your next exercise after the video. Make note of this code after completing it.

SELECT code, year
  FROM economies
	UNION ALL
SELECT country_code, year
  FROM populations
ORDER BY code, year;

code	year
ABW		2010
ABW		2015
AFG		2010
...
[Showing 100 out of 814 rows]

</> Intersect

Again, order by code and then by year, both in ascending order.

Note the number of records here (given at the bottom of query result) compared to the similar UNION ALL query result (814 records).

SELECT code, year
  FROM economies
	INTERSECT
SELECT country_code, year
  FROM populations
ORDER BY code, year;

code	year
AFG		2010
AFG		2015
AGO		2010
...
[Showing 100 out of 380 rows]

Use INTERSECT to answer this question with countries and cities!

SELECT name
  FROM countries
	INTERSECT
SELECT name
  FROM cities;

name
Singapore

</> Review union and intersect

Which of the following combinations of terms and definitions is correct?

  • UNION: returns all records (potentially duplicates) in both tables
  • UNION ALL: returns only unique records
  • INTERSECT: returns only records appearing in both tables
  • None of the above are matched correctly

</> Except

Order the resulting field in ascending order.

Can you spot the city/cities that are actually capital cities which this query misses?

SELECT name
  FROM cities
	EXCEPT
SELECT capital
  FROM countries
ORDER BY name;

name
Abidjan
Ahmedabad
Alexandria
...
[Showing 100 out of 170 rows]

Order by capital in ascending order.

The cities table contains information about 236 of the world’s most populous cities. The result of your query may surprise you in terms of the number of capital cities that DO NOT appear in this list!

SELECT capital
  FROM countries
	EXCEPT
SELECT name
  FROM cities
ORDER BY capital;

capital
Agana
Amman
Amsterdam
...
[Showing 100 out of 136 rows]

</> Semi-join

Flash back to our Intro to SQL for Data Science course and begin by selecting all country codes in the Middle East as a single field result using SELECT, FROM, and WHERE.

SELECT code
  FROM countries
WHERE region = 'Middle East';

code
ARE
ARM
AZE
...
[Showing 18 out of 18 rows]

Below the commented code, select only unique languages by name appearing in the languages table.

Order the resulting single field table by name in ascending order.

SELECT DISTINCT name
  FROM languages
ORDER BY name;

name
Afar
Afrikaans
Akyem
...
[Showing 100 out of 396 rows]

Now combine the previous two queries into one query:

  • Add a WHERE IN statement to the SELECT DISTINCT query, and use the commented out query from the first instruction in there. That way, you can determine the unique languages spoken in the Middle East.

Carefully review this result and its code after completing it. It serves as a great example of subqueries, which are the focus of Chapter 4.

SELECT DISTINCT name
  FROM languages
WHERE code IN
  (SELECT code
   FROM countries
   WHERE region = 'Middle East')
ORDER BY name;

name
Arabic
Aramaic
Armenian
...
[Showing 27 out of 27 rows]

</> Relating semi-join to a tweaked inner join

Let’s revisit the code from the previous exercise, which retrieves languages spoken in the Middle East.

SELECT DISTINCT name
FROM languages
WHERE code IN
  (SELECT code
   FROM countries
   WHERE region = 'Middle East')
ORDER BY name;

Sometimes problems solved with semi-joins can also be solved using an inner join.

SELECT languages.name AS language
FROM languages
INNER JOIN countries
ON languages.code = countries.code
WHERE region = 'Middle East'
ORDER BY language;

This inner join isn’t quite right. What is missing from this second code block to get it to match with the correct answer produced by the first block?

  • HAVING instead of WHERE
  • DISTINCT
  • UNIQUE

</> Diagnosing problems using anti-join

Begin by determining the number of countries in countries that are listed in Oceania using SELECT, FROM, and WHERE.

SELECT COUNT(name)
  FROM countries
WHERE continent = 'Oceania';

count
19
  • Complete an inner join with countries AS c1 on the left and currencies AS c2 on the right to get the different currencies used in the countries of Oceania.
  • Match ON the code field in the two tables.
  • Include the country code, country name, and basic_unit AS currency.

Observe query result and make note of how many different countries are listed here.

SELECT c1.code, c1.name, basic_unit AS currency
  FROM countries AS c1
  	INNER JOIN currencies AS c2
    ON c1.code = c2.code
WHERE continent = 'Oceania';

code	name				currency
AUS		Australia			Australian dollar
PYF		French Polynesia	CFP franc
KIR		Kiribati			Australian dollar
...

Note that not all countries in Oceania were listed in the resulting inner join with currencies. Use an anti-join to determine which countries were not included!

  • Use NOT IN and (SELECT code FROM currencies) as a subquery to get the country code and country name for the Oceanian countries that are not included in the currencies table.
SELECT *
  FROM countries
  WHERE continent = 'Oceania'
  	AND code NOT in
  	(SELECT code
  	 FROM currencies);

code	name							continent	region		surface_area	indep_year	local_name					gov_form				capital		cap_long	cap_lat
ASM		American Samoa					Oceania		Polynesia	199				null		Amerika Samoa				US Territory			Pago Pago	-170.691	-14.2846
FJI		Fiji Islands					Oceania		Melanesia	18274			1970		Fiji Islands				Republic				Suva		178.399		-18.1149
GUM		Guam							Oceania		Micronesia	549				null		Guam						US Territory			Agana		144.794		13.4443
FSM		Micronesia, Federated States of	Oceania		Micronesia	702				1990		Micronesia					Federal Republic		Palikir		158.185		6.91771
MNP		Northern Mariana Islands		Oceania		Micronesia	464				null		Northern Mariana Islands	Commonwealth of the US	Saipan		145.765		15.1935

</> Set theory challenge

Identify the country codes that are included in either economies or currencies but not in populations.

Use that result to determine the names of cities in the countries that match the specification in the previous instruction.

SELECT name
  FROM cities AS c1
  WHERE country_code IN
(
    SELECT e.code
    FROM economies AS e
    UNION
    SELECT c2.code
    FROM currencies AS c2
    EXCEPT
    SELECT p.country_code
    FROM populations AS p
);

name
Bucharest
Kaohsiung
New Taipei City
Taichung
Tainan
Taipei

4. Subqueries

</> Subquery inside where

Begin by calculating the average life expectancy across all countries for 2015.

SELECT AVG(life_expectancy)
  FROM populations
WHERE year = 2015;

avg
71.6763415481105

Recall that you can use SQL to do calculations for you. Suppose we wanted only records that were above 1.15 * 100 in terms of life expectancy for 2015:

SELECT *
  FROM populations
WHERE life_expectancy > 1.15 * 100
  AND year = 2015;

Select all fields from populations with records corresponding to larger than 1.15 times the average you calculated in the first task for 2015. In other words, change the 100 in the example above with a subquery.

SELECT *
  FROM populations
WHERE life_expectancy > 1.15 * # 1.15 * 子查询
  (SELECT AVG(life_expectancy)
   FROM populations
   WHERE year = 2015)
  AND year = 2015;

pop_id	country_code	year	fertility_rate	life_expectancy	size
21		AUS				2015	1.833			82.4512			23789800
376		CHE				2015	1.54			83.1976			8281430
356		ESP				2015	1.32			83.3805			46444000
...

Make use of the capital field in the countries table in your subquery.

Select the city name, country code, and urban area population fields.

SELECT name, country_code, urbanarea_pop
  FROM cities
WHERE name IN
  (SELECT capital
   FROM countries)
ORDER BY urbanarea_pop DESC;

name		country_code	urbanarea_pop
Beijing		CHN				21516000
Dhaka		BGD				14543100
Tokyo		JPN				13513700
...

</> Subquery inside select

Just Submit Answer here!

SELECT name AS country,
  (SELECT COUNT(*)
   FROM cities
   WHERE countries.code = cities.country_code) AS cities_num
FROM countries
ORDER BY cities_num DESC, country
LIMIT 9;

country		cities_num
China		36
India		18
Japan		11
...

Remove the comments around the second query and comment out the first query instead.

Convert the GROUP BY code to use a subquery inside of SELECT, i.e. fill in the blanks to get a result that matches the one given using the GROUP BY code in the first query.

Again, sort the result by cities_num descending and then by country ascending.

SELECT name AS country,
  (SELECT COUNT(*)
   FROM cities
   WHERE countries.code = cities.country_code) AS cities_num
FROM countries
ORDER BY cities_num DESC, country
LIMIT 9;

country		cities_num
China		36
India		18
Japan		11
...

</> Subquery inside from

Begin by determining for each country code how many languages are listed in the languages table using SELECT, FROM, and GROUP BY.

Alias the aggregated field as lang_num.

SELECT code, COUNT(name) AS lang_num
  FROM languages
GROUP BY code;

code	lang_num
BLZ		9
BGD		2
ITA		4
...

Include the previous query (aliased as subquery) as a subquery in the FROM clause of a new query.

Select the local name of the country from countries.

Also, select lang_num from subquery.

Make sure to use WHERE appropriately to match code in countries and in subquery.

Sort by lang_num in descending order.

SELECT local_name, lang_num
  FROM countries
  CROSS	JOIN (SELECT code, COUNT(name) AS lang_num
  	FROM languages
  	 GROUP BY code) AS subquery
  WHERE countries.code = subquery.code
ORDER BY lang_num DESC;

local_name		lang_num
Zambia			19
Zimbabwe		16
YeItyop´iya		16
...

</> Advanced subquery

Now it’s time to append the second part’s query to the first part’s query using AND and IN to obtain the name of the country, its continent, and the maximum inflation rate for each continent in 2015!

For the sake of practice, change all joining conditions to use ON instead of USING (based upon the same column, code).

Revisit the sample output in the assignment text at the beginning of the exercise to see how this matches up.

SELECT name, continent, inflation_rate
  FROM countries
  	INNER JOIN economies
    USING(code)
WHERE year = 2015;

name			continent	inflation_rate
Afghanistan		Asia		-1.549
Angola			Africa		10.287
Albania			Europe		1.896
...

Select the maximum inflation rate in 2015 AS max_inf grouped by continent using the previous step’s query as a subquery in the FROM clause.

Thus, in your subquery you should:

  • Create an inner join with countries on the left and economies on the right with USING (without aliasing your tables or columns).
  • Retrieve the country name, continent, and inflation rate for 2015.
  • Alias the subquery as subquery.

This will result in the six maximum inflation rates in 2015 for the six continents as one field table. Make sure to not include continent in the outer SELECT statement.

SELECT MAX(inflation_rate) AS max_inf
  FROM (
      SELECT name, continent, inflation_rate
      FROM countries
      INNER JOIN economies
      USING(code)
      WHERE year = 2015) AS subquery
GROUP BY continent;

max_inf
48.684
9.784
39.403
...

Now it’s time to append your second query to your first query using AND and IN to obtain the name of the country, its continent, and the maximum inflation rate for each continent in 2015.

For the sake of practice, change all joining conditions to use ON instead of USING.

SELECT name, continent, inflation_rate
  FROM countries
	INNER JOIN economies
	ON countries.code = economies.code
  WHERE year = 2015
    AND
        inflation_rate IN (
        SELECT MAX(inflation_rate) AS max_inf FROM(
             SELECT name, continent, inflation_rate
             FROM countries
             INNER JOIN economies
             ON countries.code = economies.code
             WHERE year = 2015) AS subquery
        GROUP BY continent);

name		continent		inflation_rate
Haiti		North America	7.524
Malawi		Africa			21.858
Nauru		Oceania			9.784
...

</> Subquery challenge

Select the country code, inflation rate, and unemployment rate.

Order by inflation rate ascending.

Do not use table aliasing in this exercise.

SELECT code, inflation_rate, unemployment_rate
  FROM economies
  WHERE year = 2015 AND code NOT in
  	(SELECT code
  	 FROM countries
  	 WHERE (gov_form = 'Constitutional Monarchy' OR gov_form LIKE '%Republic%'))
ORDER BY inflation_rate;

code	inflation_rate	unemployment_rate
AFG		-1.549			null
CHE		-1.14			3.178
PRI		-0.751			12
...

</> Subquery review

Within which SQL clause are subqueries most frequently found?

  • WHERE
  • FROM
  • SELECT
  • IN

</> Final challenge

Select unique country names. Also select the total investment and imports fields.

Use a left join with countries on the left. (An inner join would also work, but please use a left join here.)

Match on code in the two tables AND use a subquery inside of ON to choose the appropriate languages records.

Order by country name ascending.

Use table aliasing but not field aliasing in this exercise.

SELECT DISTINCT name, total_investment, imports
  FROM countries AS c
    LEFT JOIN economies AS e
      ON (c.code = e.code
        AND c.code IN (
          SELECT l.code
          FROM languages AS l
          WHERE official = 'true'
        ) )
  WHERE region = 'Central America' AND year = 2015
ORDER BY name;

name	total_investment	imports
Belize	22.014	6.743
Costa Rica	20.218	4.629
El Salvador	13.983	8.193
...

Include the name of region, its continent, and average fertility rate aliased as avg_fert_rate.

Sort based on avg_fert_rate ascending.

Remember that you’ll need to GROUP BY all fields that aren’t included in the aggregate function of SELECT.

SELECT region, continent, AVG(fertility_rate) AS avg_fert_rate
  FROM countries AS c1
    INNER JOIN populations AS p
      ON c1.code = p.country_code
  WHERE year = 2015
GROUP BY region, continent
ORDER BY avg_fert_rate;

region				continent	avg_fert_rate
Southern Europe		Europe		1.42610000371933
Eastern Europe		Europe		1.49088890022702
Baltic Countries	Europe		1.60333331425985
...

Select the city name, country code, city proper population, and metro area population.

Calculate the percentage of metro area population composed of city proper population for each city in cities, aliased as city_perc.

Focus only on capital cities in Europe and the Americas in a subquery.

Make sure to exclude records with missing data on metro area population.

Order the result by city_perc descending.

Then determine the top 10 capital cities in Europe and the Americas in terms of this city_perc percentage.

SELECT name, country_code, city_proper_pop, metroarea_pop,  
      city_proper_pop / metroarea_pop * 100 AS city_perc
  FROM cities
  WHERE name IN
    (SELECT capital
     FROM countries
     WHERE (continent = 'Europe'
        OR continent LIKE '%America%'))
       AND metroarea_pop IS NOT NULL
ORDER BY city_perc desc
LIMIT 10;

name	country_code	city_proper_pop	metroarea_pop	city_perc
Lima	PER				8852000			10750000		82.3441863059998
Bogota	COL				7878780			9800000			80.3957462310791
Moscow	RUS				12197600		16170000		75.4334926605225
...
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
Beginning Apache Pig: Big Data Processing Made Easy English | 29 Dec. 2016 | ISBN: 1484223365 | 300 Pages | PDF | 4.9 MB Learn to use Apache Pig to develop lightweight big data applications easily and quickly. This book shows you many optimization techniques and covers every context where Pig is used in big data analytics. Beginning Apache Pig shows you how Pig is easy to learn and requires relatively little time to develop big data applications. The book is divided into four parts: the complete features of Apache Pig; integration with other tools; how to solve complex business problems; and optimization of tools. You'll discover topics such as MapReduce and why it cannot meet every business need; the features of Pig Latin such as data types for each load, store, joins, groups, and ordering; how Pig workflows can be created; submitting Pig jobs using Hue; and working with Oozie. You'll also see how to extend the framework by writing UDFs and custom load, store, and filter functions. Finally you'll cover different optimization techniques such as gathering statistics about a Pig script, joining strategies, parallelism, and the role of data formats in good performance. What You Will Learn • Use all the features of Apache Pig • Integrate Apache Pig with other tools • Extend Apache Pig • Optimize Pig Latin code • Solve different use cases for Pig Latin Who This Book Is For All levels of IT professionals: architects, big data enthusiasts, engineers, developers, and big data administrators

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值