Course Description
Now that you've learned the basics of SQL in our Intro to SQL for Data Science course, it's time to supercharge your queries using joins and relational set theory! In this course you'll learn all about the power of joining tables while exploring interesting features of countries and their cities throughout the world. You will master inner and outer joins, as well as self-joins, semi-joins, anti-joins and cross joins - fundamental tools in any PostgreSQL wizard's toolbox. You'll fear set theory no more, after learning all about unions, intersections, and except clauses through easy-to-understand diagrams and examples. Lastly, you'll be introduced to the challenging topic of subqueries. You will see a visual perspective to grasp the ideas throughout the course using the mediums of Venn diagrams and other linking illustrations.
Imports
from sqlalchemy import create_engine
from sqlalchemy import MetaData
from sqlalchemy import Table
from sqlalchemy import Column
from sqlalchemy import Integer, String
from sqlalchemy import inspect
import pandas as pd
from pprint import pprint as pp
Pandas Configuration Options
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)
PostgreSQL Connection
engine
to use your database username and password.engine = create_engine('postgresql://postgres:postgres@localhost/postgres')
meta = MetaData(schema="countries")
conn = engine.connect()
Example(s) without pd.DataFrames - use fetchall
result = conn.execute("SELECT datname from pg_database")
rows = result.fetchall()
[x for x in rows]
cities = conn.execute("select * \
from countries.countries \
inner join countries.cities \
on countries.cities.country_code = countries.code")
cities_res = cities.fetchall()
cities_list = [x for i, x in enumerate(cities_res) if i < 10]
cities_list
In this chapter, you'll be introduced to the concept of joining tables, and explore the different ways you can enrich your queries using inner joins and self-joins. You'll also see how to use the case statement to split up a field into different categories.
cities = conn.execute("select * from countries.cities")
cities_df = pd.read_sql("select * from countries.cities", conn)
cities_df.head()
sql_stmt = "SELECT * FROM countries.cities INNER JOIN countries.countries ON countries.cities.country_code = countries.countries.code"
pd.read_sql(sql_stmt, conn).head()
sql_stmt = "SELECT countries.cities.name as city, countries.countries.name as country, \
countries.countries.region FROM countries.cities INNER JOIN countries.countries ON \
countries.cities.country_code = countries.countries.code"
pd.read_sql(sql_stmt, conn).head()
SELECT left_table.id as L_id
left_table.val as L_val
right_table.val as R_val
FROM left_table
INNER JOIN right_table
ON left_table.id = right_table.id;
USING
clause instead of the ON
clause.SELECT left_table.id as L_id
left_table.val as L_val
right_table.val as R_val
FROM left_table
INNER JOIN right_table
USING (id);
SELECT p1.country, p1.continent, prime_minister, president
FROM leaders.presidents AS p1
INNER JOIN leaders.prime_ministers as p2
USING (country);
sql_stmt = "SELECT p1.country, p1.continent, prime_minister, president \
FROM leaders.presidents AS p1 \
INNER JOIN leaders.prime_ministers as p2 \
USING (country)"
pd.read_sql(sql_stmt, conn).head()
Why does the following code result in an error?
SELECT c.name AS country, l.name AS language
FROM countries AS c
INNER JOIN languages AS l;
INNER JOIN
requires a specification of the key field (or fields) in each table.When joining tables with a common field name, e.g.
SELECT *
FROM countries
INNER JOIN economies
ON countries.code = economies.code
You can use USING
as a shortcut:
SELECT *
FROM countries
INNER JOIN economies
USING(code)
You'll now explore how this can be done with the countries
and languages
tables.
Instructions
countries
on the left and languages
on the right with USING(code)
.AS country
,AS language
, and-- 4. Select fields
SELECT c.name as country, c.continent, l.name as language, l.official
-- 1. From countries (alias as c)
FROM countries as c
-- 2. Join to languages (as l)
INNER JOIN languages as l
-- 3. Match using code
USING (code)
sql_stmt = "SELECT c.name as country, c.continent, l.name as language, l.official \
FROM countries.countries as c \
INNER JOIN countries.languages as l \
USING (code)"
pd.read_sql(sql_stmt, conn).head()
sql_stmt = "SELECT * \
FROM leaders.prime_ministers"
pm_df = pd.read_sql(sql_stmt, conn)
pm_df.head()
SELECT p1.country AS country1, p2.country AS country2, p1.continent
FROM leaders.prime_ministers as p1
INNER JOIN prime_ministers as p2
ON p1.continent = p2.continent;
sql_stmt = "SELECT p1.country AS country1, p2.country AS country2, p1.continent \
FROM leaders.prime_ministers as p1 \
INNER JOIN leaders.prime_ministers as p2 \
ON p1.continent = p2.continent"
pm_df_1 = pd.read_sql(sql_stmt, conn)
pm_df_1.head()
SELECT p1.country AS country1, p2.country AS country2, p1.continent
FROM leaders.prime_ministers as p1
INNER JOIN prime_ministers as p2
ON p1.continent = p2.continent AND p1.country <> p2.country;
sql_stmt = "SELECT p1.country AS country1, p2.country AS country2, p1.continent \
FROM leaders.prime_ministers as p1 \
INNER JOIN leaders.prime_ministers as p2 \
ON p1.continent = p2.continent AND p1.country <> p2.country"
pm_df_2 = pd.read_sql(sql_stmt, conn)
pm_df_2.head()
pm_df_1.equals(pm_df_2)
AND
clause can check that multiple conditions are met.SELECT name, continent, indep_year,
CASE WHEN indep_year < 1900 THEN 'before 1900'
WHEN indep_year <= 1930 THEN 'between 1900 and 1930'
ELSE 'after 1930' END
AS indep_year_group
FROM states
ORDER BY indep_year_group;
sql_stmt = "SELECT name, continent, indep_year, \
CASE WHEN indep_year < 1900 THEN 'before 1900' \
WHEN indep_year <= 1930 THEN 'between 1900 and 1930' \
ELSE 'after 1930' END \
AS indep_year_group \
FROM leaders.states \
ORDER BY indep_year_group"
pd.read_sql(sql_stmt, conn)
In this exercise, you'll use the populations
table to perform a self-join to calculate the percentage increase in population from 2010 to 2015 for each country code!
Since you'll be joining the populations
table to itself, you can alias populations
as p1
and also populations
as p2
. This is good practice whenever you are aliasing and your tables have the same first letter. Note that you are required to alias the tables with self-joins.
Instructions 1/3
populations
with itself ON country_code
.country_code
from p1
and the size
field from both p1
and p2
. SQL won't allow same-named fields, so alias p1.size as size2010
and p2.size as size2015
.-- 4. Select fields with aliases
SELECT p1.size as size2010,
p1.country_code,
p2.size as size2015
-- 1. From populations (alias as p1)
FROM countries.populations as p1
-- 2. Join to itself (alias as p2)
INNER JOIN countries.populations as p2
-- 3. Match on country code
ON p1.country_code = p2.country_code
sql_stmt = "SELECT p1.size as size2010, p1.country_code, p2.size as size2015 \
FROM countries.populations as p1 \
INNER JOIN countries.populations as p2 \
ON p1.country_code = p2.country_code"
pd.read_sql(sql_stmt, conn).head()
Instructions 2/3
Notice from the result that for each country_code you have four entries laying out all combinations of 2010 and 2015.
ON
in your query to include only those records where the p1.year
(2010) matches with p2.year - 5
(2015 - 5 = 2010). This will omit the three entries per country_code
that you aren't interested in.-- 4. Select fields with aliases
SELECT p1.country_code,
p1.size as size2010,
p2.size as size2015
-- 1. From populations (alias as p1)
FROM countries.populations as p1
-- 2. Join to itself (alias as p2)
INNER JOIN countries.populations as p2
-- 3. Match on country code
ON p1.country_code = p2.country_code
-- 4. and year (with calculation)
AND p1.year = (p2.year - 5)
sql_stmt = "SELECT p1.size as size2010, p1.country_code, p2.size as size2015 \
FROM countries.populations as p1 \
INNER JOIN countries.populations as p2 \
ON p1.country_code = p2.country_code \
AND p1.year = (p2.year - 5)"
pd.read_sql(sql_stmt, conn).head()
Instructions 3/3
As you just saw, you can also use SQL to calculate values like p2.year - 5
for you. With two fields like size2010
and size2015
, you may want to determine the percentage increase from one field to the next:
With two numeric fields A
and B
, the percentage growth from A
to B
can be calculated as $$\frac{(B−A)}{A}∗100.0$$.
Add a new field to SELECT
, aliased as growth_perc
, that calculates the percentage population growth from 2010 to 2015 for each country, using p2.size
and p1.size
.
SELECT p1.country_code,
p1.size AS size2010,
p2.size AS size2015,
-- 1. calculate growth_perc
((p2.size - p1.size)/p1.size * 100.0) AS growth_perc
-- 2. From populations (alias as p1)
FROM countries.populations as p1
-- 3. Join to itself (alias as p2)
INNER JOIN countries.populations as p2
-- 4. Match on country code
ON p1.country_code = p2.country_code
-- 5. and year (with calculation)
AND p1.year = (p2.year - 5);
sql_stmt = "SELECT p1.size as size2010, p1.country_code, p2.size as size2015, \
((p2.size - p1.size)/p1.size * 100.0) AS growth_perc \
FROM countries.populations as p1 \
INNER JOIN countries.populations as p2 \
ON p1.country_code = p2.country_code \
AND p1.year = (p2.year - 5)"
pd.read_sql(sql_stmt, conn).head()
Often it's useful to look at a numerical field not as raw data, but instead as being in different categories or groups.
You can use CASE
with WHEN
, THEN
, ELSE
, and END
to define a new grouping field.
Instructions
Using the countries table, create a new field AS geosize_group that groups the countries into three groups:
surface_area
is greater than 2 million, geosize_group
is 'large'
.surface_area
is greater than 350 thousand but not larger than 2 million, geosize_group
is 'medium'
.geosize_group
is 'small'
.SELECT name, continent, code, surface_area,
-- 1. First case
CASE WHEN surface_area > 2000000 THEN 'large'
-- 2. Second case
WHEN surface_area > 350000 THEN 'medium'
-- 3. Else clause + end
ELSE 'small' END
-- 4. Alias name
AS geosize_group
-- 5. From table
FROM countries.countries;
sql_stmt = "SELECT name, continent, code, surface_area, \
CASE WHEN surface_area > 2000000 THEN 'large' \
WHEN surface_area > 350000 THEN 'medium' \
ELSE 'small' END \
AS geosize_group \
FROM countries.countries;"
pd.read_sql(sql_stmt, conn).head()
The table you created with the added geosize_group
field has been loaded for you here with the name countries_plus
. Observe the use of (and the placement of) the INTO
command to create this countries_plus
table:
If you have downloaded the data from DataCamp and already have a schema for countries, countries_plus is already one of the tables
SELECT name, continent, code, surface_area,
CASE WHEN surface_area > 2000000
THEN 'large'
WHEN surface_area > 350000
THEN 'medium'
ELSE 'small' END
AS geosize_group
INTO countries_plus
FROM countries.countries;
You will now explore the relationship between the size of a country in terms of surface area and in terms of population using grouping fields created with CASE
.
By the end of this exercise, you'll be writing two queries back-to-back in a single script. You got this!
Instructions 1/3
Using the populations
table focused only for the year
2015, create a new field AS popsize_group
to organize population size
into
'large'
(> 50 million),'medium'
(> 1 million), and'small'
groups.Select only the country code, population size, and this new popsize_group
as fields.
SELECT country_code, size,
-- 1. First case
CASE WHEN size > 50000000 THEN 'large'
-- 2. Second case
WHEN size > 1000000 THEN 'medium'
-- 3. Else clause + end
ELSE 'small' END
-- 4. Alias name
AS popsize_group
-- 5. From table
FROM countries.populations
-- 6. Focus on 2015
WHERE year = 2015;
sql_stmt = "\
SELECT country_code, size, \
CASE WHEN size > 50000000 THEN 'large' \
WHEN size > 1000000 THEN 'medium' \
ELSE 'small' END \
AS popsize_group \
FROM countries.populations \
WHERE year = 2015; \
"
pd.read_sql(sql_stmt, conn).head()
Instructions 2/3
INTO
to save the result of the previous query as pop_plus
. You can see an example of this in the countries_plus
code in the assignment text. Make sure to include a ;
at the end of your WHERE
clause!pop_plus
using SELECT * FROM pop_plus
; so that you generate results and this will display pop_plus
in query result.Execute the first part on the PostgreSQL schema to create pop_plus
SELECT country_code, size,
CASE WHEN size > 50000000 THEN 'large'
WHEN size > 1000000 THEN 'medium'
ELSE 'small' END
AS popsize_group
-- 1. Into table
INTO countries.pop_plus
FROM populations
WHERE year = 2015;
Run this below
-- 2. Select all columns of pop_plus
SELECT * FROM countries.pop_plus;
sql_stmt = "\
SELECT * FROM countries.pop_plus; \
"
pd.read_sql(sql_stmt, conn).head()
Instructions 3/3
pop_plus
using INTO
.countries_plus AS c
on the left with pop_plus AS p
on the right matching on the country code fields.geosize_group
, in ascending order so that large
appears on top.name
, continent
, geosize_group
, and popsize_group
fields.sql_stmt = "\
SELECT c.name, c.continent, c.geosize_group, p.popsize_group \
FROM countries.countries_plus AS c \
INNER JOIN countries.pop_plus AS p \
ON c.code = p.country_code \
ORDER BY geosize_group ASC \
"
q_df = pd.read_sql(sql_stmt, conn)
q_df.head()
q_df.tail()
In this chapter, you'll come to grips with different kinds of outer joins. You'll learn how to gain further insights into your data through left joins, right joins, and full joins. In addition to outer joins, you'll also work with cross joins.
INNER JOIN
SELECT p1.country, prime_minister, president
FROM prime_ministers as p1
INNER JOIN presidents p2
ON p1.country = p2.country;
LEFT JOIN
SELECT p1.country, prime_minister, president
FROM prime_ministers as p1
LEFT JOIN presidents p2
ON p1.country = p2.country;
LEFT JOIN multiple matches
RIGHT JOIN
SELECT right_table.id AS R_id,
left_table.val AS L_val,
right_talbe.vale AS R_val
FROM left_table
RIGHT JOIN right_table
ON left_table.id = right_table.id;
sql_stmt = "\
SELECT p1.country, prime_minister, president \
FROM leaders.prime_ministers as p1 \
INNER JOIN leaders.presidents as p2 \
ON p1.country = p2.country \
"
pd.read_sql(sql_stmt, conn)
sql_stmt = "\
SELECT p1.country, prime_minister, president \
FROM leaders.prime_ministers as p1 \
LEFT JOIN leaders.presidents as p2 \
ON p1.country = p2.country \
"
pd.read_sql(sql_stmt, conn)
Now you'll explore the differences between performing an inner join and a left join using the cities
and countries
tables.
You'll begin by performing an inner join with the cities
table on the left and the countries
table on the right. Remember to alias the name of the city field as city
and the name of the country field as country
.
You will then change the query to a left join. Take note of how many records are in each query here!
Instructions 1/2
-- Select the city name (with alias), the country code,
-- the country name (with alias), the region,
-- and the city proper population
SELECT c1.name AS city, code, c2.name AS country,
region, city_proper_pop
-- From left table (with alias)
FROM cities AS c1
-- Join to right table (with alias)
INNER JOIN countries AS c2
-- Match on country code
ON c1.country_code = c2.code
-- Order by descending country code
ORDER BY code DESC;
sql_stmt = "\
SELECT c1.name AS city, code, c2.name AS country, region, city_proper_pop \
FROM countries.cities AS c1 \
INNER JOIN countries.countries AS c2 \
ON c1.country_code = c2.code \
ORDER BY code DESC; \
"
pd.read_sql(sql_stmt, conn).head()
Instructions 2/2
LEFT JOIN
instead of an INNER JOIN
. After executing this query, note how many records the query result contains.SELECT c1.name AS city, code, c2.name AS country,
region, city_proper_pop
FROM cities AS c1
-- 1. Join right table (with alias)
LEFT JOIN countries AS c2
-- 2. Match on country code
ON c1.country_code = c2.code
-- 3. Order by descending country code
ORDER BY code DESC;
sql_stmt = "\
SELECT c1.name AS city, code, c2.name AS country, region, city_proper_pop \
FROM countries.cities AS c1 \
LEFT JOIN countries.countries AS c2 \
ON c1.country_code = c2.code \
ORDER BY code DESC; \
"
pd.read_sql(sql_stmt, conn).head()
Next, you'll try out another example comparing an inner join to its corresponding left join. Before you begin though, take note of how many records are in both the countries
and languages
tables below.
You will begin with an inner join on the countries
table on the left with the languages
table on the right. Then you'll change the code to a left join in the next bullet.
Note the use of multi-line comments here using /*
and */
.
Instructions 1/2
country
field as country and the name of the language
field as language.sql_stmt = "\
SELECT c.name AS country, local_name, l.name AS language, percent \
FROM countries.countries AS c \
INNER JOIN countries.languages AS l \
ON c.code = l.code \
ORDER BY country desc; \
"
res1 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res1)}')
res1.head()
Instructions 2/2
sql_stmt = "\
SELECT c.name AS country, local_name, l.name AS language, percent \
FROM countries.countries AS c \
LEFT JOIN countries.languages AS l \
ON c.code = l.code \
ORDER BY country desc; \
"
res2 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res2)}')
res2.head()
Left join (3)
You'll now revisit the use of the AVG()
function introduced in our Intro to SQL for Data Science course. You will use it in combination with left join to determine the average gross domestic product (GDP) per capita by region in 2010.
Instructions 1/3
countries
table on the left and the economies
table on the right.year
.sql_stmt = "\
SELECT name, region, gdp_percapita \
FROM countries.countries AS c \
LEFT JOIN countries.economies AS e \
ON e.code = c.code \
WHERE year = 2010; \
"
res1 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res1)}')
res1.head()
Instructions 2/3
AS avg_gdp
for each region in 2010.region
and avg_gdp
fields.sql_stmt = "\
SELECT region, AVG(gdp_percapita) as avg_gdp \
FROM countries.countries AS c \
LEFT JOIN countries.economies AS e \
ON e.code = c.code \
WHERE year = 2010 \
GROUP BY region \
ORDER BY avg_gdp DESC; \
"
res2 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res2)}')
res2.head()
Instructions 3/3
sql_stmt = "\
SELECT region, AVG(gdp_percapita) as avg_gdp \
FROM countries.countries AS c \
LEFT JOIN countries.economies AS e \
ON e.code = c.code \
WHERE year = 2010 \
GROUP BY region; \
"
res3 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res3)}')
res3.head()
Right joins aren't as common as left joins. One reason why is that you can always write a right join as a left join.
Instructions
The left join code is commented out here. Your task is to write a new query using rights joins that produces the same result as what the query using left joins produces. Keep this left joins code commented as you write your own query just below it using right joins to solve the problem.
Note the order of the joins matters in your conversion to using right joins!
SELECT cities.name AS city, urbanarea_pop, countries.name AS country,
indep_year, languages.name AS language, percent
FROM cities
LEFT JOIN countries
ON cities.country_code = countries.code
LEFT JOIN languages
ON countries.code = languages.code
ORDER BY city, language;
sql_stmt = "\
SELECT cities.name AS city, urbanarea_pop, countries.name AS country, \
indep_year, languages.name AS language, percent \
FROM countries.languages \
RIGHT JOIN countries.countries \
ON languages.code = countries.code \
RIGHT JOIN countries.cities \
ON cities.country_code = countries.code \
ORDER BY city, language; \
"
pd.read_sql(sql_stmt, conn).head()
SELECT left_table.id AS L_id,
right_table.id AS R_id,
left_table.val AS L_val,
right_table.val as R_val,
FULL left_table
FULL JOIN right_table
USING (id);
SELECT p1.country AS pm_co, p2.country AS pres_co, prime_minister, president
FROM prime_ministers AS p1
FULL JOIN presidents AS p2
ON p1.country = p2.country;
sql_stmt = "\
SELECT p1.country AS pm_co, p2.country AS pres_co, prime_minister, president \
FROM leaders.prime_ministers AS p1 \
FULL JOIN leaders.presidents AS p2 \
ON p1.country = p2.country; \
"
pd.read_sql(sql_stmt, conn)
In this exercise, you'll examine how your results differ when using a full join versus using a left join versus using an inner join with the countries
and currencies
tables.
You will focus on the North American region
and also where the name
of the country is missing. Dig in to see what we mean!
Begin with a full join with countries
on the left and currencies
on the right. The fields of interest have been SELECT
ed for you throughout this exercise.
Then complete a similar left join and conclude with an inner join.
Instructions 1/3
SELECT name AS country, code, region, basic_unit
-- 3. From countries
FROM countries
-- 4. Join to currencies
FULL JOIN currencies
-- 5. Match on code
USING (code)
-- 1. Where region is North America or null
WHERE region = 'North America' OR region IS null
-- 2. Order by region
ORDER BY region;
sql_stmt = "\
SELECT name AS country, code, region, basic_unit \
FROM countries.countries \
FULL JOIN countries.currencies \
USING (code) \
WHERE region = 'North America' OR region IS null \
ORDER BY region; \
"
pd.read_sql(sql_stmt, conn)
Instructions 2/3
LEFT JOIN
instead of a FULL JOIN
. Note what has changed compared to the FULL JOIN
result!SELECT name AS country, code, region, basic_unit
-- 3. From countries
FROM countries
-- 4. Join to currencies
LEFT JOIN currencies
-- 5. Match on code
USING (code)
-- 1. Where region is North America or null
WHERE region = 'North America' OR region IS null
-- 2. Order by region
ORDER BY region;
sql_stmt = "\
SELECT name AS country, code, region, basic_unit \
FROM countries.countries \
LEFT JOIN countries.currencies \
USING (code) \
WHERE region = 'North America' OR region IS null \
ORDER BY region; \
"
pd.read_sql(sql_stmt, conn)
Instruction 3/3
INNER JOIN
instead of a FULL JOIN
. Note what has changed compared to the FULL JOIN
and LEFT JOIN
results!SELECT name AS country, code, region, basic_unit
-- 3. From countries
FROM countries
-- 4. Join to currencies
INNER JOIN currencies
-- 5. Match on code
USING (code)
-- 1. Where region is North America or null
WHERE region = 'North America' OR region IS null
-- 2. Order by region
ORDER BY region;
sql_stmt = "\
SELECT name AS country, code, region, basic_unit \
FROM countries.countries \
INNER JOIN countries.currencies \
USING (code) \
WHERE region = 'North America' OR region IS null \
ORDER BY region; \
"
pd.read_sql(sql_stmt, conn)
Have you kept an eye out on the different numbers of records these queries returned? The FULL JOIN
query returned 17 rows, the LEFT JOIN
returned 4 rows, and the INNER JOIN
only returned 3 rows. Do these results make sense to you?
You'll now investigate a similar exercise to the last one, but this time focused on using a table with more records on the left than the right. You'll work with the languages
and countries
tables.
Begin with a full join with languages
on the left and countries
on the right. Appropriate fields have been selected for you again here.
Instructions 1/3
countries.name
starts with the capital letter 'V'
or is NULL
and arrange by countries.name
in ascending order to more clearly see the results.SELECT countries.name, code, languages.name AS language
-- 3. From languages
FROM languages
-- 4. Join to countries
FULL JOIN countries
-- 5. Match on code
USING (code)
-- 1. Where countries.name starts with V or is null
WHERE countries.name LIKE 'V%' OR countries.name IS null
-- 2. Order by ascending countries.name
ORDER BY countries.name;
sql_stmt = "\
SELECT countries.name, code, languages.name AS language \
FROM countries.languages \
FULL JOIN countries.countries \
USING (code) \
WHERE countries.name LIKE 'V%%' OR countries.name IS null \
ORDER BY countries.name; \
"
pd.read_sql(sql_stmt, conn)
Instructions 2/3
left join
instead of a full join. Note what has changed compared to the full join result!SELECT countries.name, code, languages.name AS language
-- 3. From languages
FROM languages
-- 4. Join to countries
LEFT JOIN countries
-- 5. Match on code
USING (code)
-- 1. Where countries.name starts with V or is null
WHERE countries.name LIKE 'V%' OR countries.name IS null
-- 2. Order by ascending countries.name
ORDER BY countries.name;
sql_stmt = "\
SELECT countries.name, code, languages.name AS language \
FROM countries.languages \
LEFT JOIN countries.countries \
USING (code) \
WHERE countries.name LIKE 'V%%' OR countries.name IS null \
ORDER BY countries.name; \
"
pd.read_sql(sql_stmt, conn)
Instructions 3/3
inner join
instead of a left join. Note what has changed compared to the full join and left join results.SELECT countries.name, code, languages.name AS language
-- 3. From languages
FROM languages
-- 4. Join to countries
INNER JOIN countries
-- 5. Match on code
USING (code)
-- 1. Where countries.name starts with V or is null
WHERE countries.name LIKE 'V%' OR countries.name IS null
-- 2. Order by ascending countries.name
ORDER BY countries.name;
sql_stmt = "\
SELECT countries.name, code, languages.name AS language \
FROM countries.languages \
INNER JOIN countries.countries \
USING (code) \
WHERE countries.name LIKE 'V%%' OR countries.name IS null \
ORDER BY countries.name; \
"
pd.read_sql(sql_stmt, conn)
You'll now explore using two consecutive full joins on the three tables you worked with in the previous two exercises.
Instructions
countries
on the left and languages
on the right.currencies
on the right.LIKE
to choose the Melanesia and Micronesia regions (Hint: 'M%esia'
).AS country
, region, language name AS language
, and basic and fractional units of currency.-- 7. Select fields (with aliases)
SELECT c1.name AS country, region, l.name AS language,
basic_unit, frac_unit
-- 1. From countries (alias as c1)
FROM countries AS c1
-- 2. Join with languages (alias as l)
FULL JOIN languages AS l
-- 3. Match on code
USING (code)
-- 4. Join with currencies (alias as c2)
FULL JOIN currencies AS c2
-- 5. Match on code
USING (code)
-- 6. Where region like Melanesia and Micronesia
WHERE region LIKE 'M%esia';
sql_stmt = "\
SELECT c1.name AS country, region, l.name AS language, \
basic_unit, frac_unit \
FROM countries.countries AS c1 \
FULL JOIN countries.languages AS l \
USING (code) \
FULL JOIN countries.currencies AS c2 \
USING (code) \
WHERE region LIKE 'M%%esia'; \
"
pd.read_sql(sql_stmt, conn)
A(n) ___
join is a join combining the results of a ___
join and a ___
join.
Answer the question
id
from table1
and id
from table2
(e.g. 1(A-C), 2(A-C), & 3(A-C))prime_ministers
table are scheduled for individual meetings with all presidents in the presidents table.SELECT prime_minister, president
FROM prime_ministers AS p1
CROSS JOIN presidents AS p2
WHERE p1.continent IN ('North America', 'Oceania');
sql_stmt = "\
SELECT prime_minister, president \
FROM leaders.prime_ministers AS p1 \
CROSS JOIN leaders.presidents AS p2 \
WHERE p1.continent IN ('North America', 'Oceania'); \
"
pd.read_sql(sql_stmt, conn)
This exercise looks to explore languages potentially and most frequently spoken in the cities of Hyderabad, India and Hyderabad, Pakistan.
You will begin with a cross join with cities AS c
on the left and languages AS l
on the right. Then you will modify the query using an inner join in the next tab.
Instructions 1/2
ON
or USING
.)LIKE
and Hyder%
to choose Hyderabad in both countries.AS city
and language name AS language
.-- 4. Select fields
SELECT c.name AS city, l.name AS language
-- 1. From cities (alias as c)
FROM cities AS c
-- 2. Join to languages (alias as l)
CROSS JOIN languages AS l
-- 3. Where c.name like Hyderabad
WHERE c.name LIKE 'Hyder%';
sql_stmt = "\
SELECT c.name AS city, l.name AS language \
FROM countries.cities AS c \
CROSS JOIN countries.languages AS l \
WHERE c.name LIKE 'Hyder%%'; \
"
hyderabad_lang = pd.read_sql(sql_stmt, conn)
hyderabad_lang
unique_lang = hyderabad_lang['language'].unique()
print(len(unique_lang))
Instructions 2/2
-- 5. Select fields
SELECT c.name AS city, l.name AS language
-- 1. From cities (alias as c)
FROM cities as c
-- 2. Join to languages (alias as l)
INNER JOIN languages AS l
-- 3. Match on country code
ON c.country_code = l.code
-- 4. Where c.name like Hyderabad
WHERE c.name like 'Hyder%';
sql_stmt = "\
SELECT c.name AS city, l.name AS language \
FROM countries.cities AS c \
INNER JOIN countries.languages AS l \
ON c.country_code = l.code \
WHERE c.name LIKE 'Hyder%%'; \
"
pd.read_sql(sql_stmt, conn)
Now that you're fully equipped to use outer joins, try a challenge problem to test your knowledge!
In terms of life expectancy for 2010, determine the names of the lowest five countries and their regions.
Instructions
AS country
, region
, and life expectancy AS life_exp
.LEFT JOIN
, WHERE
, ORDER BY
, and LIMIT
.-- Select fields
SELECT c.name AS country, c.region, p.life_expectancy AS life_exp
-- From countries (alias as c)
FROM countries AS c
-- Join to populations (alias as p)
LEFT JOIN populations AS p
-- Match on country code
ON c.code = p.country_code
-- Focus on 2010
WHERE p.year = 2010
-- Order by life_exp
ORDER BY life_exp
-- Limit to 5 records
LIMIT 5;
sql_stmt = "\
SELECT c.name AS country, c.region, p.life_expectancy AS life_exp \
FROM countries.countries AS c \
LEFT JOIN countries.populations AS p \
ON c.code = p.country_code \
WHERE p.year = 2010 \
ORDER BY life_exp \
LIMIT 5; \
"
pd.read_sql(sql_stmt, conn)
In this chapter, you'll learn more about set theory using Venn diagrams and you will be introduced to union, union all, intersect, and except clauses. You'll finish by investigating semi-joins and anti-joins, which provide a nice introduction to subqueries.
UNION
includes every record in both tables, but DOES NOT double count those that are in both tables.UNION ALL
includes every record in both tables and DOES replicate those that are in both tables, represented by the black centerINTERSECT
results in only those records found in both of the tables.EXCEPT
results in only those records in one table, BUT NOT the other.left_one
and right_one
, one
corresponds to each table having one field.UNION
on these two fields, the result is each record appearing in either table, but notice the id values of 1
and 4
in right_one
, are not included again in the UNION
since they were already found in the left_one
table.UNION ALL
includes all duplicates in its result, resulting in 8 total records for the example.monarchs
table in the leaders
databaseUNION
on the prime_ministers
and monarchs
tablesSELECT prime_minister AS leader, country
FROM leaders.prime_ministers
UNION
SELECT monarch, country
FROM leaders.monarchs
ORDER BY country;
prime_minister
field has been aliased as leader. The resulting field from the UNION
will have the name leader.SELECT prime_minister AS leader, country
FROM leaders.prime_ministers
UNION ALL
SELECT monarch, country
FROM leaders.monarchs
ORDER BY country;
UNION
and UNION ALL
clauses do not do the lookup step that JOIN
s do, they stack records on top of each other from one table to the next.sql_stmt = "\
SELECT * \
FROM leaders.monarchs; \
"
pd.read_sql(sql_stmt, conn)
sql_stmt = "\
SELECT prime_minister AS leader, country \
FROM leaders.prime_ministers \
UNION \
SELECT monarch, country \
FROM leaders.monarchs \
ORDER BY country; \
"
pd.read_sql(sql_stmt, conn)
sql_stmt = "\
SELECT prime_minister AS leader, country \
FROM leaders.prime_ministers \
UNION ALL \
SELECT monarch, country \
FROM leaders.monarchs \
ORDER BY country; \
"
pd.read_sql(sql_stmt, conn)
Near query result to the right, you will see two new tables with names economies2010
and economies2015
.
Instructions
economies2010
. The economies
table is also included for reference.-- Select fields from 2010 table
SELECT *
-- From 2010 table
FROM countries.economies2010
-- Set theory clause
UNION
-- Select fields from 2015 table
SELECT *
-- From 2015 table
FROM countries.economies2015
-- Order by code and year
ORDER BY code, year;
sql_stmt = "\
SELECT * \
FROM countries.economies2010 \
UNION \
SELECT * \
FROM countries.economies2015 \
ORDER BY code, year; \
"
pd.read_sql(sql_stmt, conn)
UNION
can also be used to determine all occurrences of a field across multiple tables. Try out this exercise with no starter code.
Instructions
cities
or the currencies
table. The result should be a table with only one field called country_code
.country_code
in alphabetical order.-- Select field
SELECT country_code
-- From cities
FROM countries.cities
-- Set theory clause
UNION
-- Select field
SELECT code
-- From currencies
FROM countries.currencies
-- Order by country_code
ORDER BY country_code;
sql_stmt = "\
SELECT country_code \
FROM countries.cities \
UNION \
SELECT code \
FROM countries.currencies \
ORDER BY country_code; \
"
country_codes = pd.read_sql(sql_stmt, conn)
country_codes.head()
country_codes.tail()
As you saw, duplicates were removed from the previous two exercises by using UNION
.
To include duplicates, you can use UNION ALL
.
Instructions
economies
or the populations
tables. Order by code
then year
.-- Select fields
SELECT code, year
-- From economies
FROM countries.economies
-- Set theory clause
UNION ALL
-- Select fields
SELECT country_code, year
-- From populations
FROM countries.populations
-- Order by code, year
ORDER BY code, year;
sql_stmt = "\
SELECT code, year \
FROM countries.economies \
UNION ALL \
SELECT country_code, year \
FROM countries.populations \
ORDER BY code, year; \
"
country_codes_year = pd.read_sql(sql_stmt, conn)
country_codes_year.head()
country_codes_year.tail()
INTERSECT
works in a similar fashion to UNION
and UNION ALL
, but remember from the Venn diagram, INTERSECT
only includes those records in common to both tables and fields selected.SELECT id
FROM left_one
INTERSECT
SELECT id
FROM right_one;
SELECT country
FROM leaders.prime_ministers
INTERSECT
SELECT country
FROM leaders.presidents
SELECT country, prime_minister as leader
FROM leaders.prime_ministers
INTERSECT
SELECT country, president
FROM leaders.presidents
INTERSECT
looks at two columns, it includes both columns in the search.INTERSECT
looks for records in common, not individual key fields like what a join does to match.sql_stmt = "\
SELECT country \
FROM leaders.prime_ministers \
INTERSECT \
SELECT country \
FROM leaders.presidents \
"
pd.read_sql(sql_stmt, conn)
Repeat the previous UNION ALL
exercise, this time looking at the records in common for country code and year for the economies
and populations
tables.
Instructions
code
and then by year
, both in ascending order.UNION ALL
query result (814 records).-- Select fields
SELECT code, year
-- From economies
FROM countries.economies
-- Set theory clause
INTERSECT
-- Select fields
SELECT country_code, year
-- From populations
FROM countries.populations
-- Order by code and year
ORDER BY code, year;
sql_stmt = "\
SELECT code, year \
FROM countries.economies \
INTERSECT \
SELECT country_code, year \
FROM countries.populations \
ORDER BY code, year; \
"
pd.read_sql(sql_stmt, conn)
As you think about major world cities and their corresponding country, you may ask which countries also have a city with the same name as their country name?
Instructions
INTERSECT
to answer this question with countries
and cities
!-- Select fields
SELECT name
-- From countries
FROM countries.countries
-- Set theory clause
INTERSECT
-- Select fields
SELECT name
-- From cities
FROM countries.cities;
sql_stmt = "\
SELECT name \
FROM countries.countries \
INTERSECT \
SELECT name \
FROM countries.cities; \
"
pd.read_sql(sql_stmt, conn)
Hong Kong is part of China, but it appears separately here because it has its own ISO country code. Depending upon your analysis, treating Hong Kong separately could be useful or a mistake. Always check your dataset closely before you perform an analysis!
Which of the following combinations of terms and definitions is correct?
Answer the question
EXCEPT
includes only the records in one table, but not in the other.EXCEPT
clause.SELECT monarch, country
FROM leaders.monarchs
EXCEPT
SELECT prime_minister, country
FROM leaders.prime_ministers;
sql_stmt = "\
SELECT monarch, country \
FROM leaders.monarchs \
EXCEPT \
SELECT prime_minister, country \
FROM leaders.prime_ministers; \
"
pd.read_sql(sql_stmt, conn)
Get the names of cities in cities
which are not noted as capital cities in countries
as a single field result.
Note that there are some countries in the world that are not included in the countries
table, which will result in some cities not being labeled as capital cities when in fact they are.
Instructions
-- Select field
SELECT name
-- From cities
FROM countries.cities
-- Set theory clause
EXCEPT
-- Select field
SELECT capital
-- From countries
FROM countries.countries
-- Order by result
ORDER BY name;
sql_stmt = "\
SELECT name \
FROM countries.cities \
EXCEPT \
SELECT capital \
FROM countries.countries \
ORDER BY name; \
"
pd.read_sql(sql_stmt, conn).head()
Now you will complete the previous query in reverse!
Determine the names of capital cities that are not listed in the cities
table.
Instructions
capital
in ascending order.cities
table contains information about 236 of the world's most populous cities. The result of your query may surprise you in terms of the number of capital cities that DO NOT appear in this list!-- Select field
SELECT capital
-- From countries
FROM countries.countries
-- Set theory clause
EXCEPT
-- Select field
SELECT name
-- From cities
FROM countries.cities
-- Order by ascending capital
ORDER BY capital;
sql_stmt = "\
SELECT capital \
FROM countries.countries \
EXCEPT \
SELECT name \
FROM countries.cities \
ORDER BY capital; \
"
pd.read_sql(sql_stmt, conn).head()
semi-joins
and anti-joins
don't have the same built-in SQL syntax that INNER JOIN and LEFT JOIN have.semi-joins
and anti-joins
are useful tools in filtering table records on the records of another table.SELECT name
FROM leaders.states
WHERE indep_year < 1800;
country
0 Portugal
1 Spain
SELECT president, country, continent
FROM leaders.presidents
president country continent
0 Abdel Fattah el-Sisi Egypt Africa
1 Marcelo Rebelo de Sousa Portugal Europe
2 Jovenel Moise Haiti North America
3 Jose Mujica Uruguay South America
4 Ellen Johnson Sirleaf Liberia Africa
5 Michelle Bachelet Chile South America
6 Tran Dai Quang Vietnam Asia
SEMI JOIN
¶SELECT president, country, continent
FROM leaders.presidents
WHERE country IN
(SELECT name
FROM leaders.states
WHERE indep_year < 1800);
semi-join
chooses records in the first table where a condition IS met in the second table.semi-join
matches records by key field in the right table with those in the left.sql_stmt = "\
SELECT president, country, continent \
FROM leaders.presidents \
WHERE country IN \
(SELECT name \
FROM leaders.states \
WHERE indep_year < 1800); \
"
pd.read_sql(sql_stmt, conn)
ANTI JOIN
¶anti-join
chooses records in the first table where a condition IS NOT met in the second table.NOT
to exclude those countries in the subquery.SELECT president, country, continent
FROM leaders.presidents
WHERE continent LIKE '%America'
AND country NOT IN
(SELECT name
FROM leaders.states
WHERE indep_year < 1800);
anti-join
picks out those columns in the left table that do not match the condition on the right table.sql_stmt = "\
SELECT president, country, continent \
FROM leaders.presidents \
WHERE continent LIKE '%%America' \
AND country NOT IN \
(SELECT name \
FROM leaders.states \
WHERE indep_year < 1800); \
"
pd.read_sql(sql_stmt, conn)
You are now going to use the concept of a semi-join to identify languages spoken in the Middle East.
Instructions 1/3
SELECT
, FROM
, and WHERE
.-- Select code
SELECT code
-- From countries
FROM countries
-- Where region is Middle East
WHERE region = 'Middle East';
sql_stmt = "\
SELECT code \
FROM countries.countries \
WHERE region = 'Middle East'; \
"
pd.read_sql(sql_stmt, conn)
You are now going to use the concept of a semi-join to identify languages spoken in the Middle East.
Instructions 2/3
/*
and */
. You'll come back to it!languages
table.name
in ascending order.-- Select field
SELECT DISTINCT name
-- From languages
FROM languages
-- Order by name
ORDER BY name;
sql_stmt = "\
SELECT DISTINCT name \
FROM countries.languages \
ORDER BY name; \
"
pd.read_sql(sql_stmt, conn)
You are now going to use the concept of a semi-join to identify languages spoken in the Middle East.
Instructions 3/3
Now combine the previous two queries into one query:
WHERE IN
statement to the ELECT DISTINCT
query, and use the commented out query from the first instruction in there. That way, you can determine the unique languages spoken in the Middle East.Carefully review this result and its code after completing it. It serves as a great example of subqueries, which are the focus of Chapter 4.
-- Select distinct fields
SELECT DISTINCT name
-- From languages
FROM languages
-- Where in statement
WHERE code IN
-- Subquery
(SELECT code
FROM countries
WHERE region = 'Middle East')
-- Order by name
ORDER BY name;
sql_stmt = "\
SELECT DISTINCT name \
FROM countries.languages \
WHERE code IN \
(SELECT code \
FROM countries.countries \
WHERE region = 'Middle East') \
ORDER BY name; \
"
pd.read_sql(sql_stmt, conn)
Let's revisit the code from the previous exercise, which retrieves languages spoken in the Middle East.
SELECT DISTINCT name
FROM languages
WHERE code IN
(SELECT code
FROM countries
WHERE region = 'Middle East')
ORDER BY name;
Sometimes problems solved with semi-joins can also be solved using an inner join.
SELECT languages.name AS language
FROM languages
INNER JOIN countries
ON languages.code = countries.code
WHERE region = 'Middle East'
ORDER BY language;
This inner join isn't quite right. What is missing from this second code block to get it to match with the correct answer produced by the first block?
Possible Answers
HAVING
instead of WHERE
DISTINCT
UNIQUE
sql_stmt = "\
SELECT DISTINCT languages.name AS language \
FROM countries.languages \
INNER JOIN countries.countries \
ON languages.code = countries.code \
WHERE region = 'Middle East' \
ORDER BY language; \
"
pd.read_sql(sql_stmt, conn)
Another powerful join in SQL is the anti-join. It is particularly useful in identifying which records are causing an incorrect number of records to appear in join queries.
You will also see another example of a subquery here, as you saw in the first exercise on semi-joins. Your goal is to identify the currencies used in Oceanian countries!
Instructions 1/3
countries
that are listed in Oceania using SELECT
, FROM
, and WHERE
.-- Select statement
SELECT count(name)
-- From countries
FROM countries
-- Where continent is Oceania
WHERE continent = 'Oceania';
sql_stmt = "\
SELECT count(name) \
FROM countries.countries \
WHERE continent = 'Oceania'; \
"
pd.read_sql(sql_stmt, conn)
Instructions 2/3
countries AS c1
on the left and currencies AS c2
on the right to get the different currencies used in the countries of Oceania.ON
the code
field in the two tables.code
, country name
, and basic_unit AS currency
.Observe query result and make note of how many different countries are listed here.
-- 5. Select fields (with aliases)
SELECT c1.code, c1.name, c2.basic_unit as currency
-- 1. From countries (alias as c1)
FROM countries as c1
-- 2. Join with currencies (alias as c2)
INNER JOIN currencies as c2
-- 3. Match on code
ON c1.code = c2.code
-- 4. Where continent is Oceania
WHERE continent = 'Oceania';
sql_stmt = "\
SELECT c1.code, c1.name, c2.basic_unit as currency \
FROM countries.countries as c1 \
INNER JOIN countries.currencies as c2 \
ON c1.code = c2.code \
WHERE continent = 'Oceania'; \
"
pd.read_sql(sql_stmt, conn)
Instructions 3/3
Note that not all countries in Oceania were listed in the resulting inner join with currencies. Use an anti-join to determine which countries were not included!
NOT IN
and (SELECT code FROM currencies)
as a subquery to get the country code and country name for the Oceanian countries that are not included in the currencies
table.-- 3. Select fields
SELECT code, name
-- 4. From Countries
FROM countries
-- 5. Where continent is Oceania
WHERE continent = 'Oceania'
-- 1. And code not in
AND code NOT IN
-- 2. Subquery
(SELECT code
FROM currencies);
sql_stmt = "\
SELECT code, name \
FROM countries.countries \
WHERE continent = 'Oceania' \
AND code NOT IN \
(SELECT code \
FROM countries.currencies); \
"
pd.read_sql(sql_stmt, conn)
Congratulations! You've now made your way to the challenge problem for this third chapter. Your task here will be to incorporate two of UNION
/UNION ALL
/INTERSECT
/EXCEPT
to solve a challenge involving three tables.
In addition, you will use a subquery as you have in the last two exercises! This will be great practice as you hop into subqueries more in Chapter 4!
Instructions
economies
or currencies
but not in populations
.-- Select the city name
SELECT country_code, name
-- Alias the table where city name resides
FROM cities AS c1
-- Choose only records matching the result of multiple set theory clauses
WHERE country_code IN
(
-- Select appropriate field from economies AS e
SELECT e.code
FROM economies AS e
-- Get all additional (unique) values of the field from currencies AS c2
UNION ALL
SELECT c2.code
FROM currencies AS c2
-- Exclude those appearing in populations AS p
EXCEPT
SELECT p.country_code
FROM populations AS p
);
sql_stmt = "\
SELECT country_code, name \
FROM countries.cities AS c1 \
WHERE country_code IN \
( SELECT e.code \
FROM countries.economies AS e \
UNION ALL \
SELECT c2.code \
FROM countries.currencies AS c2 \
EXCEPT \
SELECT p.country_code \
FROM countries.populations AS p); \
"
pd.read_sql(sql_stmt, conn)
In this closing chapter, you'll learn how to use nested queries to add some finesse to your data insights. You'll also wrap all of the content covered throughout this course into solving three challenge problems.
WHERE
statement.leaders.states
SELECT name, indep_year, fert_rate, women_parli_perc
FROM leaders.states;
Average fert_rate
SELECT AVG(fert_rate)
FROM leaders.states;
Asian countries below average fert_rate
SELECT name, fert_rate
FROM leaders.states
WHERE continent = 'Asia'
AND fert_rate <
(SELECT AVG(fert_rate)
FROM leaders.states);
sql_stmt = "\
SELECT name, fert_rate \
FROM leaders.states \
WHERE continent = 'Asia' \
AND fert_rate < \
(SELECT AVG(fert_rate) \
FROM leaders.states); \
"
pd.read_sql(sql_stmt, conn)
SELECT
clause.Count the number of countries listed in states table for each continent in the prime_ministers
table.
Continents in the prime_ministers
table
SELECT DISTINCT continent
FROM prime_ministers;
Determine the counts of the number of countries in states for each of the continents in the last slide
SELECT count(name)
FROM leaders.states
WHERE continent IN
(SELECT DISTINCT continent
FROM leaders.prime_ministers);
--Returns total count of countries
SELECT DISTINCT continent,
(SELECT COUNT(*)
FROM leaders.states
WHERE prime_ministers.continent = states.continent) as countries_num
FROM leaders.prime_ministers
prime_ministers
table in the main query.SELECT
statement, you need to give the subquery an alias (e.g. countries_num
in the example)sql_stmt = "\
SELECT DISTINCT continent, \
(SELECT COUNT(*) \
FROM leaders.states \
WHERE prime_ministers.continent = states.continent) as countries_num \
FROM leaders.prime_ministers \
"
pd.read_sql(sql_stmt, conn)
You'll now try to figure out which countries had high average life expectancies (at the country level) in 2015.
Instructions 1/2
-- Select average life_expectancy
SELECT avg(life_expectancy)
-- From populations
FROM populations
-- Where year is 2015
WHERE year = 2015;
sql_stmt = "\
SELECT avg(life_expectancy) \
FROM countries.populations \
WHERE year = 2015; \
"
pd.read_sql(sql_stmt, conn)
Instructions 2/2
1.15 * 100
in terms of life expectancy for 2015:SELECT *
FROM populations
WHERE life_expectancy > 1.15 * 100
AND year = 2015;
populations
with records corresponding to larger than 1.15 times the average you calculated in the first task for 2015. In other words, change the 100
in the example above with a subquery.-- Select fields
SELECT *
-- From populations
FROM populations
-- Where life_expectancy is greater than
WHERE life_expectancy >
-- 1.15 * subquery
1.15 *
(SELECT avg(life_expectancy)
FROM populations
WHERE year = 2015)
AND year = 2015;
sql_stmt = "\
SELECT * \
FROM countries.populations \
WHERE life_expectancy > \
1.15 * \
(SELECT avg(life_expectancy) \
FROM countries.populations \
WHERE year = 2015) \
AND year = 2015; \
"
pd.read_sql(sql_stmt, conn)
Use your knowledge of subqueries in WHERE
to get the urban area population for only capital cities.
Instructions
capital
field in the countries
table in your subquery.-- 2. Select fields
SELECT name, country_code, urbanarea_pop
-- 3. From cities
FROM cities
-- 4. Where city name in the field of capital cities
WHERE name IN
-- 1. Subquery
(SELECT capital
FROM countries)
ORDER BY urbanarea_pop DESC;
sql_stmt = "\
SELECT name, country_code, urbanarea_pop \
FROM countries.cities \
WHERE name IN \
(SELECT capital \
FROM countries.countries) \
ORDER BY urbanarea_pop DESC; \
"
pd.read_sql(sql_stmt, conn)
In this exercise, you'll see how some queries can be written using either a join or a subquery.
You have seen previously how to use GROUP BY
with aggregate functions and an inner join to get summarized information from multiple tables.
The code given in query.sql selects the top nine countries in terms of number of cities appearing in the cities
table. Recall that this corresponds to the most populous cities in the world. Your task will be to convert the commented out code to get the same result as the code shown.
Instructions 1/2
SELECT countries.name AS country, COUNT(*) AS cities_num
FROM cities
INNER JOIN countries
ON countries.code = cities.country_code
GROUP BY country
ORDER BY cities_num DESC, country
LIMIT 9;
sql_stmt = "\
SELECT countries.name AS country, COUNT(*) AS cities_num \
FROM countries.cities \
INNER JOIN countries.countries \
ON countries.code = cities.country_code \
GROUP BY country \
ORDER BY cities_num DESC, country \
LIMIT 9; \
"
pd.read_sql(sql_stmt, conn)
Instructions 2/2
GROUP BY
code to use a subquery inside of SELECT
, i.e. fill in the blanks to get a result that matches the one given using the GROUP BY
code in the first query.cities_num
descending and then by country
ascending.SELECT countries.name AS country,
(SELECT count(*)
FROM cities
WHERE countries.code = cities.country_code) AS cities_num
FROM countries
ORDER BY cities_num DESC, country
LIMIT 9;
sql_stmt = "\
SELECT countries.name AS country, \
(SELECT count(*) \
FROM countries.cities \
WHERE countries.code = cities.country_code) AS cities_num \
FROM countries.countries \
ORDER BY cities_num DESC, country \
LIMIT 9; \
"
pd.read_sql(sql_stmt, conn)
FROM
clause.
Determine the maximum percentage of women in parliament for each continent listing in leaders.states
SELECT continent, MAX(women_parli_perc) AS max_perc
FROM states
GROUP BY continent
ORDER BY continent;
continent
is included as one of th fields in the SELECT
clause, since we are grouping based on that field.sql_stmt = "\
SELECT continent, MAX(women_parli_perc) AS max_perc \
FROM leaders.states \
GROUP BY continent \
ORDER BY continent; \
"
pd.read_sql(sql_stmt, conn)
Focusing on records in monarchs
FROM
clause, by adding a comma between themSELECT monarchs.continent
FROM monarchs, states
WHERE monarchs.continent = states.continent
ORDER BY continent;
sql_stmt = "\
SELECT monarchs.continent \
FROM leaders.monarchs, leaders.states \
WHERE monarchs.continent = states.continent \
ORDER BY continent; \
"
pd.read_sql(sql_stmt, conn)
Finishing the subquery
DISTINCT
in the SELECT
statement.SELECT DISTINCT monarchs.continent, subquery.max_perc
FROM monarchs,
(SELECT continent, MAX(women_parli_perc) AS max_perc
FROM states
GROUP BY continent) AS subquery
WHERE monarchs.continent = subquery.continent
ORDER BY continent;
max_perc
column included with continent?FROM
clause, include the subquery instead and alias it with a name like subquery
.FROM
clause.sql_stmt = "\
SELECT DISTINCT monarchs.continent, subquery.max_perc \
FROM leaders.monarchs, \
(SELECT continent, MAX(women_parli_perc) AS max_perc \
FROM leaders.states \
GROUP BY continent) AS subquery \
WHERE monarchs.continent = subquery.continent \
ORDER BY continent; \
"
pd.read_sql(sql_stmt, conn)
The last type of subquery you will work with is one inside of FROM
.
You will use this to determine the number of languages spoken for each country, identified by the country's local name! (Note this may be different than the name
field and is stored in the local_name
field.)
Instructions 1/2
Begin by determining for each country code how many languages are listed in the languages
table using SELECT
, FROM
, and GROUP BY
.
Alias the aggregated field as lang_num
.
-- Select fields (with aliases)
SELECT code, count(name) AS lang_num
-- From languages
FROM languages
-- Group by code
GROUP BY code;
sql_stmt = "\
SELECT code, count(name) AS lang_num \
FROM countries.languages \
GROUP BY code \
ORDER BY lang_num DESC; \
"
lang_count = pd.read_sql(sql_stmt, conn)
print(lang_count.head())
print(lang_count.tail())
Instructions 2/2
subquery
) as a subquery in the FROM
clause of a new query.countries
.lang_num
from subquery
.WHERE
appropriately to match code
in countries
and in subquery
.lang_num
in descending order.-- Select fields
SELECT local_name, subquery.lang_num
-- From countries
FROM countries,
-- Subquery (alias as subquery)
(SELECT code, count(name) as lang_num
FROM languages
GROUP BY code) AS subquery
-- Where codes match
WHERE countries.code = subquery.code
-- Order by descending number of languages
ORDER BY lang_num DESC;
sql_stmt = "\
SELECT local_name, subquery.lang_num \
FROM countries.countries, \
(SELECT code, count(name) as lang_num \
FROM countries.languages \
GROUP BY code) AS subquery \
WHERE countries.code = subquery.code \
ORDER BY lang_num DESC; \
"
lang_count = pd.read_sql(sql_stmt, conn)
print(lang_count.head())
print(lang_count.tail())
You can also nest multiple subqueries to answer even more specific questions.
In this exercise, for each of the six continents listed in 2015, you'll identify which country had the maximum inflation rate (and how high it was) using multiple subqueries. The table result of your query in Task 3 should look something like the following, where anything between <
>
will be filled in with appropriate values:
+------------+---------------+-------------------+
| name | continent | inflation_rate |
|------------+---------------+-------------------|
| <country1> | North America | <max_inflation1> |
| <country2> | Africa | <max_inflation2> |
| <country3> | Oceania | <max_inflation3> |
| <country4> | Europe | <max_inflation4> |
| <country5> | South America | <max_inflation5> |
| <country6> | Asia | <max_inflation6> |
+------------+---------------+-------------------+
Again, there are multiple ways to get to this solution using only joins, but the focus here is on showing you an introduction into advanced subqueries.
Instructions 1/3
countries
on the left and economies
on the right with USING
. Do not alias your tables or columns.-- Select fields
SELECT name, continent, inflation_rate
-- From countries
FROM countries
-- Join to economies
INNER JOIN economies
-- Match on code
USING (code)
-- Where year is 2015
WHERE year = 2015;
sql_stmt = "\
SELECT name, continent, inflation_rate \
FROM countries.countries \
INNER JOIN countries.economies \
USING (code) \
WHERE year = 2015; \
"
inf_rate = pd.read_sql(sql_stmt, conn)
print(inf_rate.head())
print(inf_rate.tail())
Instructions 2/3
subquery
in the FROM
clause.AS max_inf
grouped by continent.continent
in the outer SELECT
statement.)-- Select fields
SELECT max(inflation_rate) as max_inf
-- Subquery using FROM (alias as subquery)
FROM (
SELECT name, continent, inflation_rate
FROM countries
INNER JOIN economies
USING (code)
WHERE year = 2015) AS subquery
-- Group by continent
GROUP BY continent;
sql_stmt = "\
SELECT max(inflation_rate) as max_inf \
FROM ( \
SELECT name, continent, inflation_rate \
FROM countries.countries \
INNER JOIN countries.economies \
USING (code) \
WHERE year = 2015) AS subquery \
GROUP BY continent; \
"
pd.read_sql(sql_stmt, conn)
Instructions 3/3
WHERE
, AND
, and IN
to obtain the name of the country, its continent, and the maximum inflation rate for each continent in 2015. Revisit the sample output in the assignment text at the beginning of the exercise to see how this matches up.ON
instead of USING
.-- Select fields
SELECT name, continent, inflation_rate
-- From countries
FROM countries
-- Join to economies
INNER JOIN economies
-- Match on code
ON countries.code = economies.code
-- Where year is 2015
WHERE year = 2015
-- And inflation rate in subquery (alias as subquery)
AND inflation_rate IN (
SELECT max(inflation_rate) as max_inf
FROM (
SELECT name, continent, inflation_rate
FROM countries
INNER JOIN economies
ON countries.code = economies.code
WHERE year = 2015) AS subquery
GROUP BY continent);
sql_stmt = "\
SELECT name, continent, inflation_rate \
FROM countries.countries \
INNER JOIN countries.economies \
ON countries.code = economies.code \
WHERE year = 2015 \
AND inflation_rate IN ( \
SELECT max(inflation_rate) as max_inf \
FROM ( \
SELECT name, continent, inflation_rate \
FROM countries.countries \
INNER JOIN countries.economies \
ON countries.code = economies.code \
WHERE year = 2015) AS subquery \
GROUP BY continent); \
"
pd.read_sql(sql_stmt, conn)
Let's test your understanding of the subqueries with a challenge problem! Use a subquery to get 2015 economic data for countries that do not have
gov_form
of 'Constitutional Monarchy'
or'Republic'
in their gov_form
.Here, gov_form
stands for the form of the government for each country. Review the different entries for gov_form
in the countries
table.
Instructions
-- Select fields
SELECT code, inflation_rate, unemployment_rate
-- From economies
FROM economies
-- Where year is 2015 and code is not in
WHERE year = 2015 AND code NOT IN
-- Subquery
(SELECT code
FROM countries
WHERE (gov_form = 'Constitutional Monarchy' OR gov_form LIKE '%Republic'))
-- Order by inflation rate
ORDER BY inflation_rate;
sql_stmt = "\
SELECT code, inflation_rate, unemployment_rate \
FROM countries.economies \
WHERE year = 2015 AND code NOT IN \
(SELECT code \
FROM countries.countries \
WHERE (gov_form = 'Constitutional Monarchy' OR gov_form LIKE '%%Republic')) \
ORDER BY inflation_rate; \
"
pd.read_sql(sql_stmt, conn)
Within which SQL clause are subqueries most frequently found?
Answer the question
Types of joins:
Notes
CROSS JOIN
does not have an ON
or USING
clause, but otherwise looks very similar to the code for an INNER JOIN
, LEFT JOIN
, RIGHT JOIN
, or FULL JOIN
.Set Theory Clauses
UNION
includes every record in both tables but DOES NOT double count those that are in both tables.UNION ALL
does replicate those that are in both tables.INTERSECT
gives only those records found in both of the two tables.EXCEPT
gives only those records in one table but not the other.Semi-joins and Anti-joins
Types of basic subqueries
WHERE
clause.SELECT
clauses and inside FROM
clauses.WHERE
clauses too.Welcome to the end of the course! The next three exercises will test your knowledge of the content covered in this course and apply many of the ideas you've seen to difficult problems. Good luck!
Read carefully over the instructions and solve them step-by-step, thinking about how the different clauses work together.
In this exercise, you'll need to get the country names and other 2015 data in the economies
table and the countries
table for Central American countries with an official language.
Instructions
countries
on the left. (An inner join would also work, but please use a left join here.)code
in the two tables AND
use a subquery inside of ON
to choose the appropriate languages
records.-- Select fields
SELECT DISTINCT c.name, e.total_investment, e.imports
-- From table (with alias)
FROM countries AS c
-- Join with table (with alias)
LEFT JOIN economies AS e
-- Match on code
ON (c.code = e.code
-- and code in Subquery
AND c.code IN (
SELECT l.code
FROM languages AS l
WHERE official = 'true'
) )
-- Where region and year are correct
WHERE year = 2015 AND region = 'Central America'
-- Order by field
ORDER BY name;
sql_stmt = "\
SELECT DISTINCT c.name, e.total_investment, e.imports \
FROM countries.countries AS c \
LEFT JOIN countries.economies AS e \
ON (c.code = e.code \
AND c.code IN ( \
SELECT l.code \
FROM countries.languages AS l \
WHERE official = 'true' \
) ) \
WHERE year = 2015 AND region = 'Central America' \
ORDER BY name; \
"
pd.read_sql(sql_stmt, conn)
Whoofta! That was challenging, huh?
Let's ease up a bit and calculate the average fertility rate for each region in 2015.
Instructions
avg_fert_rate
.avg_fert_rate
ascending.GROUP BY
all fields that aren't included in the aggregate function of SELECT
.-- Select fields
SELECT c.region, c.continent, AVG(p.fertility_rate) AS avg_fert_rate
-- From left table
FROM populations AS p
-- Join to right table
INNER JOIN countries AS c
-- Match on join condition
ON p.country_code = c.code
-- Where specific records matching some condition
WHERE year = 2015
-- Group appropriately
GROUP BY c.continent, c.region
-- Order appropriately
ORDER BY avg_fert_rate;
sql_stmt = "\
SELECT c.region, c.continent, AVG(p.fertility_rate) AS avg_fert_rate \
FROM countries.populations AS p \
INNER JOIN countries.countries AS c \
ON p.country_code = c.code \
WHERE year = 2015 \
GROUP BY c.continent, c.region \
ORDER BY avg_fert_rate; \
"
pd.read_sql(sql_stmt, conn)
Welcome to the last challenge problem. By now you're a query warrior! Remember that these challenges are designed to take you to the limit to solidify your SQL knowledge! Take a deep breath and solve this step-by-step.
You are now tasked with determining the top 10 capital cities in Europe and the Americas in terms of a calculated percentage using city_proper_pop
and metroarea_pop
in cities
.
Do not use table aliasing in this exercise.
Instructions
cities
, aliased as city_perc
.city_perc
descending.city_perc
percentage.-- Select fields
SELECT name, country_code, city_proper_pop, metroarea_pop,
-- Calculate city_perc
city_proper_pop / metroarea_pop * 100 AS city_perc
-- From appropriate table
FROM cities
-- Where
WHERE name IN
-- Subquery
(SELECT capital
FROM countries
WHERE (continent = 'Europe'
OR continent LIKE '%America'))
AND metroarea_pop IS NOT NULL
-- Order appropriately
ORDER BY city_perc desc
-- Limit amount
LIMIT 10;
sql_stmt = "\
SELECT name, country_code, city_proper_pop, metroarea_pop, \
city_proper_pop / metroarea_pop * 100 AS city_perc \
FROM countries.cities \
WHERE name IN \
(SELECT capital \
FROM countries.countries \
WHERE (continent = 'Europe' \
OR continent LIKE '%%America')) \
AND metroarea_pop IS NOT NULL \
ORDER BY city_perc desc \
LIMIT 10; \
"
pd.read_sql(sql_stmt, conn)
conn.close()