Joining Data in SQL

Posted Nov 21, 2019 Updated Apr 27, 2024

By Trenton McKinney 115 min read

Course: DataCamp: Joining Data in SQL
This notebook was created as a reproducible reference.
The material is from the course
I completed the exercises
I installed and configured a PostgreSQL database to run all the queries locally.
If you find the content beneficial, consider a DataCamp Subscription.

Course Description

Now that you’ve learned the basics of SQL in our Intro to SQL for Data Science course, it’s time to supercharge your queries using joins and relational set theory! In this course you’ll learn all about the power of joining tables while exploring interesting features of countries and their cities throughout the world. You will master inner and outer joins, as well as self-joins, semi-joins, anti-joins and cross joins - fundamental tools in any PostgreSQL wizard’s toolbox. You’ll fear set theory no more, after learning all about unions, intersections, and except clauses through easy-to-understand diagrams and examples. Lastly, you’ll be introduced to the challenging topic of subqueries. You will see a visual perspective to grasp the ideas throughout the course using the mediums of Venn diagrams and other linking illustrations.

Datasets

Imports

  
from sqlalchemy import create_engine, MetaData, Table, Column, Integer, String, inspect, text

import pandas as pd
from pprint import pprint as pp

Pandas Configuration Options

  
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 300)
pd.set_option('display.expand_frame_repr', True)

PostgreSQL Connection

In order to run this Notebook, install, setup and configure a PostgreSQL database with the previously mentioned datasets.
Edit engine to use your database username and password.

  
engine = create_engine('postgresql://postgres:postgres@localhost/postgres')

Specify a default schema that should be used for tables defined within this MetaData instance. This means that when you define a new Table with this MetaData instance, SQLAlchemy knows to associate that table with the countries schema. However, this association affects only those operations and objects directly utilizing the MetaData instance for defining tables or reflecting database schema.

  
meta = MetaData(schema="countries")

  
conn = engine.connect()

# conn.close()

Example(s) without pd.DataFrames - use fetchall

For raw SQL queries executed through a connection (conn.execute), the MetaData’s schema setting doesn’t automatically apply. The SQL execution context is lower-level and doesn’t use the MetaData schema setting for its operations. Therefore, when executing raw SQL statements, you must explicitly set the schema in your queries, as mentioned previously, or adjust the search path for the session.

Data Model Overview

Tables and Primary Keys

Countries
- Primary key: code
- Fields: name, continent, region, surface_area, indep_year, local_name, gov_form, capital, cap_long, cap_lat
- Relations: Potentially linked to other tables through the code field.
Cities
- Primary key: name
- Fields: country_code, city_proper_pop, metroarea_pop, urbanarea_pop
- Relations: Linked to Countries through country_code.
Languages
- Primary key: lang_id
- Fields: code, name, percent, official
- Relations: Possibly linked to Countries through code.
Economies (general, 2010, 2015 versions)
- Primary key: code for each version
- Fields: year, income_group, gdp_percapita (and other economic indicators depending on the table)
- Relations: Linked to Countries through code.
Populations
- Primary key: pop_id
- Fields: country_code, year, fertility_rate, life_expectancy, size
- Relations: Linked to Countries through country_code.
Currencies
- Primary key: curr_id
- Fields: code, basic_unit, curr_code, frac_unit, frac_perbasic
- Relations: Potentially linked to Countries through code.
Countries Plus
- Primary key: code
- Fields: name, continent, surface_area, geosize_group
- Relations: An alternative or additional representation of Countries.

Data Integrity and Relationships

Even though the SQL file lacks explicit foreign key constraints, you can maintain data integrity through application logic or by adding these constraints yourself if you are managing the database schema. These constraints would ensure that all references between tables are valid and consistent, preventing orphan records and maintaining referential integrity.

  
result = conn.execute(text("SELECT * FROM countries.cities"))

  
rows = result.fetchall()

  
[x for x in rows]

[('Abidjan', 'CIV', 4765000.0, None, 4765000.0),
 ('Abu Dhabi', 'ARE', 1145000.0, None, 1145000.0),
 ('Abuja', 'NGA', 1235880.0, 6000000.0, 1235880.0),
 ('Accra', 'GHA', 2070463.0, 4010054.0, 2070463.0),
 ('Addis Ababa', 'ETH', 3103673.0, 4567857.0, 3103673.0),
 ('Ahmedabad', 'IND', 5570585.0, None, 5570585.0),
 ('Alexandria', 'EGY', 4616625.0, None, 4616625.0),
 ('Algiers', 'DZA', 3415811.0, 5000000.0, 3415811.0),
 ('Almaty', 'KAZ', 1703481.0, None, 1703481.0),
 ('Ankara', 'TUR', 5271000.0, 4585000.0, 5271000.0),
 ('Auckland', 'NZL', 1495000.0, 1614300.0, 1495000.0),
 ('Baghdad', 'IRQ', 7180889.0, None, 7180889.0),
 ('Baku', 'AZE', 3202300.0, 4308740.0, 3202300.0),
 ('Bandung', 'IDN', 2575478.0, 6965655.0, 2575478.0),
 ('Bangkok', 'THA', 8280925.0, 14998000.0, 8280925.0),
 ('Barcelona', 'ESP', 1604555.0, 5375774.0, 1604555.0),
 ('Barranquilla', 'COL', 1386865.0, 2370753.0, 1386865.0),
 ('Basra', 'IRQ', 2750000.0, None, 2750000.0),
 ('Beijing', 'CHN', 21516000.0, 24900000.0, 21516000.0),
 ('Belo Horizonte', 'BRA', 2502557.0, 5156217.0, 2502557.0),
 ('Bengaluru', 'IND', 8425970.0, 9807000.0, 8425970.0),
 ('Berlin', 'DEU', 3517424.0, 5871022.0, 3517424.0),
 ('Bhopal', 'IND', 1798218.0, 1864389.0, 1798218.0),
 ('Birmingham', 'GBR', 1111300.0, 3683000.0, 1111300.0),
 ('Bogota', 'COL', 7878783.0, 9800000.0, 7878783.0),
 ('Brasilia', 'BRA', 2556149.0, 3919864.0, 2556149.0),
 ('Brazzaville', 'COG', 1827000.0, None, 1827000.0),
 ('Brisbane', 'AUS', 1180285.0, 2349699.0, 1180285.0),
 ('Bucharest', 'ROM', 1883425.0, 2272163.0, 1883425.0),
 ('Budapest', 'HUN', 1759407.0, 2927944.0, 1759407.0),
 ('Buenos Aires', 'ARG', 3054300.0, 14122000.0, 3054300.0),
 ('Busan', 'KOR', 3510833.0, 8202239.0, 3510833.0),
 ('Cairo', 'EGY', 10230350.0, 18290000.0, 10230350.0),
 ('Calgary', 'CAN', 1235171.0, 1214839.0, 1235171.0),
 ('Cali', 'COL', 2400653.0, 3400000.0, 2400653.0),
 ('Caloocan', 'PHL', 1583978.0, None, 1583978.0),
 ('Campinas', 'BRA', 1164098.0, 3094181.0, 1164098.0),
 ('Cape Town', 'ZAF', 3740026.0, None, 3740026.0),
 ('Caracas', 'VEN', 1943901.0, 2923959.0, 1943901.0),
 ('Casablanca', 'MAR', 5117832.0, 6861739.0, 5117832.0),
 ('Changchun', 'CHN', 3815270.0, 7674439.0, 3815270.0),
 ('Changsha', 'CHN', 7044118.0, None, 7044118.0),
 ('Chaozhou', 'CHN', 2669844.0, None, 2669844.0),
 ('Chengdu', 'CHN', 4741929.0, 10376000.0, 4741929.0),
 ('Chennai', 'IND', 7088000.0, None, 7088000.0),
 ('Chicago', 'USA', 2695598.0, 9156000.0, 2695598.0),
 ('Chittagong', 'BGD', 2581643.0, 4009423.0, 2581643.0),
 ('Chongqing', 'CHN', 8189800.0, 52100100.0, 8189800.0),
 ('Cologne', 'DEU', 1057327.0, 3573500.0, 1057327.0),
 ('Cordoba', 'ARG', 1330023.0, 1528000.0, 1330023.0),
 ('Curitiba', 'BRA', 1879355.0, 3400000.0, 1879355.0),
 ('Daegu', 'KOR', 2492994.0, None, 2492994.0),
 ('Daejeon', 'KOR', 1535028.0, None, 1535028.0),
 ('Dakar', 'SEN', 1146053.0, 2452656.0, 1146053.0),
 ('Dalian', 'CHN', 2146099.0, 5935638.0, 2146099.0),
 ('Dallas', 'USA', 1317929.0, 7233323.0, 1317929.0),
 ('Dar es Salaam', 'TZA', 4364541.0, None, 4364541.0),
 ('Davao City', 'PHL', 1632991.0, 2516216.0, 1632991.0),
 ('Delhi', 'IND', 16787940.0, 24998000.0, 16787940.0),
 ('Dhaka', 'BGD', 14543124.0, None, 14543124.0),
 ('Dongguan', 'CHN', 8220207.0, None, 8220207.0),
 ('Douala', 'CMR', 2446945.0, None, 2446945.0),
 ('Dubai', 'ARE', 2643410.0, None, 2643410.0),
 ('Durban', 'ZAF', 3442361.0, None, 3442361.0),
 ('Ekurhuleni', 'ZAF', 3178470.0, None, 3178470.0),
 ('Faisalabad', 'PAK', 6480765.0, 3675000.0, 6480765.0),
 ('Fez', 'MAR', 1112072.0, None, 1112072.0),
 ('Fortaleza', 'BRA', 2609716.0, 4019213.0, 2609716.0),
 ('Foshan', 'CHN', 6151622.0, None, 6151622.0),
 ('Fukuoka', 'JPN', 1483052.0, 5590378.0, 1483052.0),
 ('Fuzhou', 'CHN', 7115369.0, None, 7115369.0),
 ('Giza', 'EGY', 4239988.0, None, 4239988.0),
 ('Guadalajara', 'MEX', 1495189.0, 4424252.0, 1495189.0),
 ('Guangzhou', 'CHN', 14043500.0, 44259000.0, 14043500.0),
 ('Guatemala City', 'GTM', 2110100.0, 4500000.0, 2110100.0),
 ('Guayaquil', 'ECU', 3600000.0, 5000000.0, 3600000.0),
 ('Gujranwala', 'PAK', 2700003.0, None, 2700003.0),
 ('Hamburg', 'DEU', 1787408.0, None, 1787408.0),
 ('Hangzhou', 'CHN', 3560391.0, None, 3560391.0),
 ('Hanoi', 'VNM', 6844100.0, None, 6844100.0),
 ('Harare', 'ZWE', 1606000.0, None, 1606000.0),
 ('Harbin', 'CHN', 4280701.0, 10635971.0, 4280701.0),
 ('Havana', 'CUB', 2106146.0, None, 2106146.0),
 ('Hefei', 'CHN', 3352076.0, None, 3352076.0),
 ('Hiroshima', 'JPN', 1196274.0, None, 1196274.0),
 ('Ho Chi Minh City', 'VNM', 7681700.0, None, 7681700.0),
 ('Hong Kong', 'CHN', 7374900.0, None, 7374900.0),
 ('Houston', 'USA', 2489558.0, 6490180.0, 2489558.0),
 ('Hyderabad (India)', 'IND', 7859250.0, None, 7859250.0),
 ('Hyderabad', 'PAK', 3429471.0, None, 3429471.0),
 ('Ibadan', 'NGA', 1338659.0, 2837000.0, 1338659.0),
 ('Incheon', 'KOR', 2978367.0, None, 2978367.0),
 ('Isfahan', 'IRN', 2243249.0, None, 2243249.0),
 ('Islamabad', 'PAK', 1900000.0, 2200000.0, 1900000.0),
 ('Istanbul', 'TUR', 14025000.0, 13520000.0, 14025000.0),
 ('Izmir', 'TUR', 4168000.0, 3019000.0, 4168000.0),
 ('Jaipur', 'IND', 3073350.0, None, 3073350.0),
 ('Jakarta', 'IDN', 10075310.0, 30539000.0, 10075310.0),
 ('Jeddah', 'SAU', 3456259.0, None, 3456259.0),
 ('Jinan', 'CHN', 2009273.0, 5853196.0, 2009273.0),
 ('Johannesburg', 'ZAF', 4434827.0, None, 4434827.0),
 ('Kabul', 'AFG', 3414100.0, None, 3414100.0),
 ('Kampala', 'UGA', 1507080.0, None, 1507080.0),
 ('Kano', 'NGA', 2153225.0, 3395000.0, 2153225.0),
 ('Kanpur', 'IND', 2768057.0, 3152317.0, 2768057.0),
 ('Kaohsiung', 'TWN', 2778918.0, None, 2778918.0),
 ('Karachi', 'PAK', 27506000.0, 25400000.0, 27506000.0),
 ('Karaj', 'IRN', 1973470.0, None, 1973470.0),
 ('Kawasaki', 'JPN', 1496035.0, None, 1496035.0),
 ('Kharkov', 'UKR', 1439566.0, 1650000.0, 1439566.0),
 ('Khartoum', 'SDN', 3639598.0, 5274321.0, 3639598.0),
 ('Kiev', 'UKR', 2908703.0, None, 2908703.0),
 ('Kinshasa', 'COD', 10130000.0, 13265000.0, 10130000.0),
 ('Kobe', 'JPN', 1536499.0, None, 1536499.0),
 ('Kochi', 'IND', 2232456.0, 4221140.0, 2232456.0),
 ('Kolkata', 'IND', 4486679.0, 14667000.0, 4486679.0),
 ('Kuala Lumpur', 'MYS', 1768000.0, 7200000.0, 1768000.0),
 ('Kwangju', 'KOR', 1477780.0, None, 1477780.0),
 ('Kyoto', 'JPN', 1474570.0, None, 1474570.0),
 ('Lagos', 'NGA', 16060303.0, 21000000.0, 16060303.0),
 ('Lahore', 'PAK', 10355000.0, 13569000.0, 10355000.0),
 ('Lanzhou', 'CHN', 2177130.0, 3616163.0, 2177130.0),
 ('Lima', 'PER', 8852000.0, 10750000.0, 8852000.0),
 ('London', 'GBR', 8673713.0, 13879757.0, 8673713.0),
 ('Los Angeles', 'USA', 3884307.0, 15058000.0, 3884307.0),
 ('Luanda', 'AGO', 2825311.0, None, 2825311.0),
 ('Lucknow', 'IND', 2815601.0, None, 2815601.0),
 ('Lusaka', 'ZMB', 1742979.0, 2467467.0, 1742979.0),
 ('Madrid', 'ESP', 3207247.0, None, 3207247.0),
 ('Makassar', 'IDN', 1338633.0, 1976168.0, 1338633.0),
 ('Managua', 'NIC', 2560789.0, None, 2560789.0),
 ('Mandalay', 'MMR', 1319452.0, 1726889.0, 1319452.0),
 ('Manila', 'PHL', 1780148.0, 12877253.0, 1780148.0),
 ('Maputo', 'MOZ', 1766184.0, 1766823.0, 1766184.0),
 ('Maracaibo', 'VEN', 1599940.0, 3897655.0, 1599940.0),
 ('Mashhad', 'IRN', 3312090.0, 3372660.0, 3312090.0),
 ('Medan', 'IDN', 2097610.0, 4103696.0, 2097610.0),
 ('Medellin', 'COL', 2441123.0, 3731447.0, 2441123.0),
 ('Mexico City', 'MEX', 8974724.0, 20063000.0, 8974724.0),
 ('Milan', 'ITA', 1359905.0, 3206465.0, 1359905.0),
 ('Minsk', 'BLR', 1959781.0, None, 1959781.0),
 ('Monterrey', 'MEX', 1130960.0, 4520329.0, 1130960.0),
 ('Montevideo', 'URY', 1305082.0, 1947604.0, 1305082.0),
 ('Montreal', 'CAN', 1649519.0, 4127100.0, 1649519.0),
 ('Moscow', 'RUS', 12197596.0, 16170000.0, 12197596.0),
 ('Multan', 'PAK', 3117000.0, None, 3117000.0),
 ('Mumbai', 'IND', 12478447.0, 17712000.0, 12478447.0),
 ('Munich', 'DEU', 1450381.0, 2606021.0, 1450381.0),
 ('Nagoya', 'JPN', 2296014.0, 9107414.0, 2296014.0),
 ('Nagpur', 'IND', 2405665.0, 2497870.0, 2405665.0),
 ('Nairobi', 'KEN', 3138369.0, None, 3138369.0),
 ('Nanjing', 'CHN', 8230000.0, 34360000.0, 8230000.0),
 ('New Taipei City', 'TWN', 3954929.0, None, 3954929.0),
 ('New York City', 'USA', 8550405.0, 20182304.0, 8550405.0),
 ('Ningbo', 'CHN', 3491597.0, 7639000.0, 3491597.0),
 ('Nizhny Novgorod', 'RUS', 1250619.0, None, 1250619.0),
 ('Novosibirsk', 'RUS', 1567087.0, None, 1567087.0),
 ('Omsk', 'RUS', 1154116.0, None, 1154116.0),
 ('Oran', 'DZA', 1560329.0, 3454078.0, 1560329.0),
 ('Osaka', 'JPN', 2691742.0, 19341976.0, 2691742.0),
 ('Ouagadougou', 'BFA', 2200000.0, 2500000.0, 2200000.0),
 ('Palembang', 'IDN', 1708413.0, None, 1708413.0),
 ('Paris', 'FRA', 2229621.0, 10601122.0, 2229621.0),
 ('Patna', 'IND', 1683200.0, 2231554.0, 1683200.0),
 ('Peshawar', 'PAK', 3201000.0, None, 3201000.0),
 ('Philadelphia', 'USA', 1567872.0, 6069875.0, 1567872.0),
 ('Phnom Penh', 'KHM', 2234566.0, None, 2234566.0),
 ('Phoenix', 'USA', 1563025.0, 4574531.0, 1563025.0),
 ('Porto Alegre', 'BRA', 1476867.0, None, 1476867.0),
 ('Prague', 'CZE', 1324000.0, None, 1324000.0),
 ('Pune', 'IND', 3115431.0, None, 3115431.0),
 ('Pyongyang', 'PRK', 3255388.0, None, 3255388.0),
 ('Qingdao', 'CHN', 6188100.0, 9046200.0, 6188100.0),
 ('Quanzhou', 'CHN', 8128533.0, 6107475.0, 8128533.0),
 ('Quezon City', 'PHL', 2936116.0, None, 2936116.0),
 ('Quito', 'ECU', 2671191.0, 4700000.0, 2671191.0),
 ('Rawalpindi', 'PAK', 3198911.0, None, 3198911.0),
 ('Recife', 'BRA', 1555039.0, 3743854.0, 1555039.0),
 ('Rio de Janeiro', 'BRA', 6429923.0, 12727000.0, 6429923.0),
 ('Riyadh', 'SAU', 5676621.0, None, 5676621.0),
 ('Rome', 'ITA', 2877215.0, 4353775.0, 2877215.0),
 ('Rosario', 'ARG', 1193605.0, 1276000.0, 1193605.0),
 ('Rostov-on-Don', 'RUS', 1119900.0, None, 1119900.0),
 ('Saint Petersburg', 'RUS', 5191690.0, 5900000.0, 5191690.0),
 ('Saitama', 'JPN', 1226656.0, None, 1226656.0),
 ('Salvador', 'BRA', 2902927.0, 3919864.0, 2902927.0),
 ('San Antonio', 'USA', 1469845.0, 2454061.0, 1469845.0),
 ('San Diego', 'USA', 1394928.0, 3095313.0, 1394928.0),
 ("Sana'a", 'YEM', 1937451.0, 2167961.0, 1937451.0),
 ('Santa Cruz de la Sierra', 'BOL', 1453549.0, 1749000.0, 1453549.0),
 ('Santiago', 'CHL', 5743719.0, None, 5743719.0),
 ('Sao Paulo', 'BRA', 12038175.0, 21090792.0, 12038175.0),
 ('Sapporo', 'JPN', 1918096.0, 2584880.0, 1918096.0),
 ('Semarang', 'IDN', 1555984.0, 3183516.0, 1555984.0),
 ('Seoul', 'KOR', 9995784.0, 12700000.0, 9995784.0),
 ('Shanghai', 'CHN', 24256800.0, 34750000.0, 24256800.0),
 ('Shantou', 'CHN', 5391028.0, 11535677.0, 5391028.0),
 ('Shenyang', 'CHN', 8106171.0, None, 8106171.0),
 ('Shenzhen', 'CHN', 10778900.0, None, 10778900.0),
 ('Shijiazhuang', 'CHN', 4303700.0, 10701600.0, 4303700.0),
 ('Shiraz', 'IRN', 1869001.0, None, 1869001.0),
 ('Singapore', 'SGP', 5535000.0, None, 5535000.0),
 ('Surabaya', 'IDN', 2765487.0, 7302283.0, 2765487.0),
 ('Surat', 'IND', 4462002.0, None, 4462002.0),
 ('Suzhou', 'CHN', 10650501.0, None, 10650501.0),
 ("T'bilisi", 'GEO', 1118035.0, 1485293.0, 1118035.0),
 ('Tabriz', 'IRN', 1733033.0, None, 1733033.0),
 ('Taichung', 'TWN', 2752413.0, None, 2752413.0),
 ('Tainan', 'TWN', 1885252.0, None, 1885252.0),
 ('Taipei', 'TWN', 2704974.0, None, 2704974.0),
 ('Tangshan', 'CHN', 3187171.0, 7536521.0, 3187171.0),
 ('Tashkent', 'UZB', 2309600.0, None, 2309600.0),
 ('Tehran', 'IRN', 8154051.0, 13532000.0, 8154051.0),
 ('Tianjin', 'CHN', 6859779.0, 15469500.0, 6859779.0),
 ('Tijuana', 'MEX', 1696923.0, 1895797.0, 1696923.0),
 ('Tokyo', 'JPN', 13513734.0, 37843000.0, 13513734.0),
 ('Toronto', 'CAN', 2731571.0, 6417516.0, 2731571.0),
 ('Tripoli', 'LBY', 1126000.0, 2267000.0, 1126000.0),
 ('Tunis', 'TUN', 1056247.0, 2643695.0, 1056247.0),
 ('Ulsan', 'KOR', 1163690.0, None, 1163690.0),
 ('Vienna', 'AUT', 1863881.0, 2600000.0, 1863881.0),
 ('Vijayawada', 'IND', 1491202.0, None, 1491202.0),
 ('Visakhapatnam', 'IND', 2035922.0, 5340000.0, 2035922.0),
 ('Warsaw', 'POL', 1753977.0, 3100844.0, 1753977.0),
 ('Wenzhou', 'CHN', 3039439.0, None, 3039439.0),
 ('Wuhan', 'CHN', 6886253.0, None, 6886253.0),
 ("Xi'an", 'CHN', 8705600.0, 13569700.0, 8705600.0),
 ('Xiamen', 'CHN', 3531347.0, 5114758.0, 3531347.0),
 ('Yangon', 'MMR', 5214000.0, None, 5214000.0),
 ('Yaounde', 'CMR', 2440462.0, None, 2440462.0),
 ('Yekaterinburg', 'RUS', 1428042.0, None, 1428042.0),
 ('Yerevan', 'ARM', 1060138.0, None, 1060138.0),
 ('Yokohama', 'JPN', 3726167.0, None, 3726167.0),
 ('Zhengzhou', 'CHN', 4122087.0, None, 4122087.0),
 ('Zhongshan', 'CHN', 3121275.0, None, 3121275.0),
 ('Zunyi', 'CHN', 6127009.0, None, 6127009.0)]

  
cities = conn.execute(text("select * \
from countries.countries \
inner join countries.cities \
on countries.cities.country_code = countries.code"))

  
cities_res = cities.fetchall()

  
cities_list = [x for i, x in enumerate(cities_res) if i < 10]

cities_list

[('CIV', "Cote d'Ivoire", 'Africa', 'Western Africa', 322463.0, 1960, 'Cote d\x92Ivoire', 'Republic', 'Yamoussoukro', -4.0305, 5.332, 'Abidjan', 'CIV', 4765000.0, None, 4765000.0),
 ('ARE', 'United Arab Emirates', 'Asia', 'Middle East', 83600.0, 1971, 'Al-Imarat al-´Arabiya al-Muttahida', 'Emirate Federation', 'Abu Dhabi', 54.3705, 24.4764, 'Abu Dhabi', 'ARE', 1145000.0, None, 1145000.0),
 ('NGA', 'Nigeria', 'Africa', 'Western Africa', 923768.0, 1960, 'Nigeria', 'Federal Republic', 'Abuja', 7.48906, 9.05804, 'Abuja', 'NGA', 1235880.0, 6000000.0, 1235880.0),
 ('GHA', 'Ghana', 'Africa', 'Western Africa', 238533.0, 1957, 'Ghana', 'Republic', 'Accra', -0.20795, 5.57045, 'Accra', 'GHA', 2070463.0, 4010054.0, 2070463.0),
 ('ETH', 'Ethiopia', 'Africa', 'Eastern Africa', 1104300.0, -1000, 'YeItyop´iya', 'Republic', 'Addis Ababa', 38.7468, 9.02274, 'Addis Ababa', 'ETH', 3103673.0, 4567857.0, 3103673.0),
 ('IND', 'India', 'Asia', 'Southern and Central Asia', 3287260.0, 1947, 'Bharat/India', 'Federal Republic', 'New Delhi', 77.225, 28.6353, 'Ahmedabad', 'IND', 5570585.0, None, 5570585.0),
 ('EGY', 'Egypt', 'Africa', 'Northern Africa', 1001450.0, 1922, 'Misr', 'Republic', 'Cairo', 31.2461, 30.0982, 'Alexandria', 'EGY', 4616625.0, None, 4616625.0),
 ('DZA', 'Algeria', 'Africa', 'Northern Africa', 2381740.0, 1962, 'Al-Jaza\x92ir/Algerie', 'Republic', 'Algiers', 3.05097, 36.7397, 'Algiers', 'DZA', 3415811.0, 5000000.0, 3415811.0),
 ('KAZ', 'Kazakhstan', 'Asia', 'Southern and Central Asia', 2724900.0, 1991, 'Qazaqstan', 'Republic', 'Astana', 71.4382, 51.1879, 'Almaty', 'KAZ', 1703481.0, None, 1703481.0),
 ('TUR', 'Turkey', 'Asia', 'Middle East', 774815.0, 1923, 'Turkiye', 'Republic', 'Ankara', 32.3606, 39.7153, 'Ankara', 'TUR', 5271000.0, 4585000.0, 5271000.0)]

Introduction to joins

In this chapter, you’ll be introduced to the concept of joining tables, and explore the different ways you can enrich your queries using inner joins and self-joins. You’ll also see how to use the case statement to split up a field into different categories.

Introduction to INNER JOIN

  
cities = conn.execute(text("select * from countries.cities"))

  
cities_df = pd.read_sql("select * from countries.cities", conn)

  
cities_df.head()

	name	country_code	city_proper_pop	metroarea_pop	urbanarea_pop
0	Abidjan	CIV	4765000.0	NaN	4765000.0
1	Abu Dhabi	ARE	1145000.0	NaN	1145000.0
2	Abuja	NGA	1235880.0	6000000.0	1235880.0
3	Accra	GHA	2070463.0	4010054.0	2070463.0
4	Addis Ababa	ETH	3103673.0	4567857.0	3103673.0

  
sql_stmt = "SELECT * FROM countries.cities INNER JOIN countries.countries ON countries.cities.country_code = countries.countries.code"
pd.read_sql(sql_stmt, conn).head()

	name	country_code	city_proper_pop	metroarea_pop	urbanarea_pop	code	name	continent	region	surface_area	indep_year	local_name	gov_form	capital	cap_long	cap_lat
0	Abidjan	CIV	4765000.0	NaN	4765000.0	CIV	Cote d'Ivoire	Africa	Western Africa	322463.0	1960	Cote dIvoire	Republic	Yamoussoukro	-4.03050	5.33200
1	Abu Dhabi	ARE	1145000.0	NaN	1145000.0	ARE	United Arab Emirates	Asia	Middle East	83600.0	1971	Al-Imarat al-´Arabiya al-Muttahida	Emirate Federation	Abu Dhabi	54.37050	24.47640
2	Abuja	NGA	1235880.0	6000000.0	1235880.0	NGA	Nigeria	Africa	Western Africa	923768.0	1960	Nigeria	Federal Republic	Abuja	7.48906	9.05804
3	Accra	GHA	2070463.0	4010054.0	2070463.0	GHA	Ghana	Africa	Western Africa	238533.0	1957	Ghana	Republic	Accra	-0.20795	5.57045
4	Addis Ababa	ETH	3103673.0	4567857.0	3103673.0	ETH	Ethiopia	Africa	Eastern Africa	1104300.0	-1000	YeItyop´iya	Republic	Addis Ababa	38.74680	9.02274

  
sql_stmt = "SELECT countries.cities.name as city, countries.countries.name as country, \
countries.countries.region FROM countries.cities INNER JOIN countries.countries ON \
countries.cities.country_code = countries.countries.code"
pd.read_sql(sql_stmt, conn).head()

	city	country	region
0	Abidjan	Cote d'Ivoire	Western Africa
1	Abu Dhabi	United Arab Emirates	Middle East
2	Abuja	Nigeria	Western Africa
3	Accra	Ghana	Western Africa
4	Addis Ababa	Ethiopia	Eastern Africa

INNER JOIN via USING

  
SELECT left_table.id as L_id
       left_table.val as L_val
       right_table.val as R_val
FROM left_table
INNER JOIN right_table
ON left_table.id = right_table.id;

When the key field you’d like to join on is the same name in both tables, you can use a USING clause instead of the ON clause.

  
SELECT left_table.id as L_id
       left_table.val as L_val
       right_table.val as R_val
FROM left_table
INNER JOIN right_table
USING (id);

Countries with prime ministers and presidents

  
SELECT p1.country, p1.continent, prime_minister, president
FROM leaders.presidents AS p1
INNER JOIN leaders.prime_ministers as p2
USING (country);

  
sql_stmt = "SELECT p1.country, p1.continent, prime_minister, president \
FROM leaders.presidents AS p1 \
INNER JOIN leaders.prime_ministers as p2 \
USING (country)"

pd.read_sql(sql_stmt, conn).head()

	country	continent	prime_minister	president
0	Egypt	Africa	Sherif Ismail	Abdel Fattah el-Sisi
1	Portugal	Europe	Antonio Costa	Marcelo Rebelo de Sousa
2	Haiti	North America	Jack Guy Lafontant	Jovenel Moise
3	Vietnam	Asia	Nguyen Xuan Phuc	Tran Dai Quang

Exercises

Review inner join using on

Why does the following code result in an error?

  
SELECT c.name AS country, l.name AS language
FROM countries AS c
  INNER JOIN languages AS l;

INNER JOIN requires a specification of the key field (or fields) in each table.

Inner join with using

When joining tables with a common field name, e.g.

  
SELECT *
FROM countries
  INNER JOIN economies
    ON countries.code = economies.code

You can use USING as a shortcut:

  
SELECT *
FROM countries
  INNER JOIN economies
    USING(code)

You’ll now explore how this can be done with the countries and languages tables.

Instructions

Inner join countries on the left and languages on the right with USING(code).
Select the fields corresponding to:
- country name AS country,
- continent name,
- language name AS language, and
- whether or not the language is official.
Remember to alias your tables using the first letter of their names.

  
-- 4. Select fields
SELECT c.name as country, c.continent, l.name as language, l.official
  -- 1. From countries (alias as c)
  FROM countries as c
  -- 2. Join to languages (as l)
  INNER JOIN languages as l
    -- 3. Match using code
    USING (code)

  
sql_stmt = "SELECT c.name as country, c.continent, l.name as language, l.official \
FROM countries.countries as c \
INNER JOIN countries.languages as l \
USING (code)"

pd.read_sql(sql_stmt, conn).head()

	country	continent	language	official
0	Afghanistan	Asia	Dari	True
1	Afghanistan	Asia	Pashto	True
2	Afghanistan	Asia	Turkic	False
3	Afghanistan	Asia	Other	False
4	Albania	Europe	Albanian	True

Self-ish joins, just in CASE

self-join on prime_ministers

  
sql_stmt = "SELECT * \
FROM leaders.prime_ministers"

pm_df = pd.read_sql(sql_stmt, conn)
pm_df.head()

	country	continent	prime_minister
0	Egypt	Africa	Sherif Ismail
1	Portugal	Europe	Antonio Costa
2	Vietnam	Asia	Nguyen Xuan Phuc
3	Haiti	North America	Jack Guy Lafontant
4	India	Asia	Narendra Modi

inner joins where a table is joined with itself
- self join
Explore how to slice a numerical field into categories using the CASE command
Self-joins are used to compare values in a field to other values of the same field from within the same table
Recall the prime ministers table:
- What if you wanted to create a new table showing countries that are in the same continenet matched as pairs?

  
SELECT p1.country AS country1, p2.country AS country2, p1.continent
FROM leaders.prime_ministers as p1
INNER JOIN prime_ministers as p2
ON p1.continent = p2.continent;

  
sql_stmt = "SELECT p1.country AS country1, p2.country AS country2, p1.continent \
FROM leaders.prime_ministers as p1 \
INNER JOIN leaders.prime_ministers as p2 \
ON p1.continent = p2.continent"

pm_df_1 = pd.read_sql(sql_stmt, conn)
pm_df_1.head()

	country1	country2	continent
0	Egypt	Egypt	Africa
1	Portugal	Spain	Europe
2	Portugal	Norway	Europe
3	Portugal	Portugal	Europe
4	Vietnam	Oman	Asia

The country column is selected twice as well as continent.
The prime ministers table is on both the left and the right.
The vital step is setting the key columns by which we match the table to itself.
- For each country, there will be a match if the country in the “right table” (that’s also prime_ministers) is in the same continent.
This is a pairing of each country with every other country in its same continent
- Conditions where country1 = country2 should not be included in the table

Finishing off the self-join on prime_ministers

  
SELECT p1.country AS country1, p2.country AS country2, p1.continent
FROM leaders.prime_ministers as p1
INNER JOIN prime_ministers as p2
ON p1.continent = p2.continent AND p1.country <> p2.country;

  
sql_stmt = "SELECT p1.country AS country1, p2.country AS country2, p1.continent \
FROM leaders.prime_ministers as p1 \
INNER JOIN leaders.prime_ministers as p2 \
ON p1.continent = p2.continent AND p1.country <> p2.country"

pm_df_2 = pd.read_sql(sql_stmt, conn)
pm_df_2.head()

	country1	country2	continent
0	Portugal	Spain	Europe
1	Portugal	Norway	Europe
2	Vietnam	Oman	Asia
3	Vietnam	Brunei	Asia
4	Vietnam	India	Asia

  
pm_df_1.equals(pm_df_2)

False

AND clause can check that multiple conditions are met.
Now a match will not be made between prime_minister and itself if the countries match

CASE WHEN and THEN

The states table contains numeric data about different countries in the six inhabited world continents
Group the year of independence into categories of:
- before 1900
- between 1900 and 1930
- and after 1930
CASE is a way to do multiple if-then-else statements

  
SELECT name, continent, indep_year,
    CASE WHEN indep_year < 1900 THEN 'before 1900'
    WHEN indep_year <= 1930 THEN 'between 1900 and 1930'
    ELSE 'after 1930' END
    AS indep_year_group
FROM states
ORDER BY indep_year_group;

  
sql_stmt = "SELECT name, continent, indep_year, \
CASE WHEN indep_year < 1900 THEN 'before 1900' \
WHEN indep_year <= 1930 THEN 'between 1900 and 1930' \
ELSE 'after 1930' END \
AS indep_year_group \
FROM leaders.states \
ORDER BY indep_year_group"

pd.read_sql(sql_stmt, conn)

	name	continent	indep_year	indep_year_group
0	Brunei	Asia	1984	after 1930
1	India	Asia	1947	after 1930
2	Oman	Asia	1951	after 1930
3	Vietnam	Asia	1945	after 1930
4	Liberia	Africa	1847	before 1900
5	Chile	South America	1810	before 1900
6	Haiti	North America	1804	before 1900
7	Portugal	Europe	1143	before 1900
8	Spain	Europe	1492	before 1900
9	Uruguay	South America	1828	before 1900
10	Norway	Europe	1905	between 1900 and 1930
11	Australia	Oceania	1901	between 1900 and 1930
12	Egypt	Africa	1922	between 1900 and 1930

Exercises

Self-join

In this exercise, you’ll use the populations table to perform a self-join to calculate the percentage increase in population from 2010 to 2015 for each country code!

Since you’ll be joining the populations table to itself, you can alias populations as p1 and also populations as p2. This is good practice whenever you are aliasing and your tables have the same first letter. Note that you are required to alias the tables with self-joins.

Instructions 1/3

Join populations with itself ON country_code.
Select the country_code from p1 and the size field from both p1 and p2. SQL won’t allow same-named fields, so alias p1.size as size2010 and p2.size as size2015.

  
-- 4. Select fields with aliases
SELECT p1.size as size2010,
    p1.country_code,
    p2.size as size2015
-- 1. From populations (alias as p1)
FROM countries.populations as p1
  -- 2. Join to itself (alias as p2)
    INNER JOIN countries.populations as p2
    -- 3. Match on country code
    ON p1.country_code = p2.country_code

  
sql_stmt = "SELECT p1.size as size2010, p1.country_code, p2.size as size2015 \
FROM countries.populations as p1 \
INNER JOIN countries.populations as p2 \
ON p1.country_code = p2.country_code"

pd.read_sql(sql_stmt, conn).head()

	size2010	country_code	size2015
0	101597.0	ABW	103889.0
1	101597.0	ABW	101597.0
2	103889.0	ABW	103889.0
3	103889.0	ABW	101597.0
4	27962208.0	AFG	32526562.0

Instructions 2/3

Notice from the result that for each country_code you have four entries laying out all combinations of 2010 and 2015.

Extend the ON in your query to include only those records where the p1.year (2010) matches with p2.year - 5 (2015 - 5 = 2010). This will omit the three entries per country_code that you aren’t interested in.

  
-- 4. Select fields with aliases
SELECT p1.country_code,
       p1.size as size2010,
       p2.size as size2015
-- 1. From populations (alias as p1)
FROM countries.populations as p1
  -- 2. Join to itself (alias as p2)
    INNER JOIN countries.populations as p2
    -- 3. Match on country code
    ON p1.country_code = p2.country_code
      -- 4. and year (with calculation)
      AND p1.year = (p2.year - 5)

  
sql_stmt = "SELECT p1.size as size2010, p1.country_code, p2.size as size2015 \
FROM countries.populations as p1 \
INNER JOIN countries.populations as p2 \
ON p1.country_code = p2.country_code \
AND p1.year = (p2.year - 5)"

pd.read_sql(sql_stmt, conn).head()

	size2010	country_code	size2015
0	101597.0	ABW	103889.0
1	27962208.0	AFG	32526562.0
2	21219954.0	AGO	25021974.0
3	2913021.0	ALB	2889167.0
4	84419.0	AND	70473.0

Instructions 3/3

As you just saw, you can also use SQL to calculate values like p2.year - 5 for you. With two fields like size2010 and size2015, you may want to determine the percentage increase from one field to the next:

With two numeric fields A and B, the percentage growth from A to B can be calculated as \(\frac{(B−A)}{A}∗100.0\).

Add a new field to SELECT, aliased as growth_perc, that calculates the percentage population growth from 2010 to 2015 for each country, using p2.size and p1.size.

  
SELECT p1.country_code,
       p1.size AS size2010, 
       p2.size AS size2015,
       -- 1. calculate growth_perc
       ((p2.size - p1.size)/p1.size * 100.0) AS growth_perc
-- 2. From populations (alias as p1)
FROM countries.populations as p1
  -- 3. Join to itself (alias as p2)
  INNER JOIN countries.populations as p2
    -- 4. Match on country code
    ON p1.country_code = p2.country_code
        -- 5. and year (with calculation)
        AND p1.year = (p2.year - 5);

  
sql_stmt = "SELECT p1.size as size2010, p1.country_code, p2.size as size2015, \
((p2.size - p1.size)/p1.size * 100.0) AS growth_perc \
FROM countries.populations as p1 \
INNER JOIN countries.populations as p2 \
ON p1.country_code = p2.country_code \
AND p1.year = (p2.year - 5)"

pd.read_sql(sql_stmt, conn).head()

	size2010	country_code	size2015	growth_perc
0	101597.0	ABW	103889.0	2.255972
1	27962208.0	AFG	32526562.0	16.323297
2	21219954.0	AGO	25021974.0	17.917192
3	2913021.0	ALB	2889167.0	-0.818875
4	84419.0	AND	70473.0	-16.519977

Case when and then

Often it’s useful to look at a numerical field not as raw data, but instead as being in different categories or groups.

You can use CASE with WHEN, THEN, ELSE, and END to define a new grouping field.

Instructions

Using the countries table, create a new field AS geosize_group that groups the countries into three groups:

If surface_area is greater than 2 million, geosize_group is 'large'.
If surface_area is greater than 350 thousand but not larger than 2 million, geosize_group is 'medium'.
Otherwise, geosize_group is 'small'.

  
SELECT name, continent, code, surface_area,
    -- 1. First case
    CASE WHEN surface_area > 2000000 THEN 'large'
        -- 2. Second case
        WHEN surface_area > 350000 THEN 'medium'
        -- 3. Else clause + end
        ELSE 'small' END
        -- 4. Alias name
        AS geosize_group
-- 5. From table
FROM countries.countries;

  
sql_stmt = "SELECT name, continent, code, surface_area, \
CASE WHEN surface_area > 2000000 THEN 'large' \
WHEN surface_area > 350000 THEN 'medium' \
ELSE 'small' END \
AS geosize_group \
FROM countries.countries;"

pd.read_sql(sql_stmt, conn).head()

	name	continent	code	surface_area	geosize_group
0	Afghanistan	Asia	AFG	652090.0	medium
1	Netherlands	Europe	NLD	41526.0	small
2	Albania	Europe	ALB	28748.0	small
3	Algeria	Africa	DZA	2381740.0	large
4	American Samoa	Oceania	ASM	199.0	small

Inner challenge

The table you created with the added geosize_group field has been loaded for you here with the name countries_plus. Observe the use of (and the placement of) the INTO command to create this countries_plus table:

If you have downloaded the data from DataCamp and already have a schema for countries, countries_plus is already one of the tables

  
SELECT name, continent, code, surface_area,
    CASE WHEN surface_area > 2000000
            THEN 'large'
       WHEN surface_area > 350000
            THEN 'medium'
       ELSE 'small' END
       AS geosize_group
INTO countries_plus
FROM countries.countries;

You will now explore the relationship between the size of a country in terms of surface area and in terms of population using grouping fields created with CASE.

By the end of this exercise, you’ll be writing two queries back-to-back in a single script. You got this!

Instructions 1/3

Using the populations table focused only for the year 2015, create a new field AS popsize_group to organize population size into

'large' (> 50 million),
'medium' (> 1 million), and
'small' groups.

Select only the country code, population size, and this new popsize_group as fields.

  
SELECT country_code, size,
    -- 1. First case
    CASE WHEN size > 50000000 THEN 'large'
        -- 2. Second case
        WHEN size > 1000000 THEN 'medium'
        -- 3. Else clause + end
        ELSE 'small' END
        -- 4. Alias name
        AS popsize_group
-- 5. From table
FROM countries.populations
-- 6. Focus on 2015
WHERE year = 2015;

  
sql_stmt = "\
SELECT country_code, size, \
CASE WHEN size > 50000000 THEN 'large' \
WHEN size > 1000000 THEN 'medium' \
ELSE 'small' END \
AS popsize_group \
FROM countries.populations \
WHERE year = 2015; \
"

pd.read_sql(sql_stmt, conn).head()

	country_code	size	popsize_group
0	ABW	103889.0	small
1	AFG	32526562.0	medium
2	AGO	25021974.0	medium
3	ALB	2889167.0	medium
4	AND	70473.0	small

Instructions 2/3

Use INTO to save the result of the previous query as pop_plus. You can see an example of this in the countries_plus code in the assignment text. Make sure to include a ; at the end of your WHERE clause!
Then, include another query below your first query to display all the records in pop_plus using SELECT * FROM pop_plus; so that you generate results and this will display pop_plus in query result.

Execute the first part on the PostgreSQL schema to create pop_plus

  
SELECT country_code, size,
    CASE WHEN size > 50000000 THEN 'large'
        WHEN size > 1000000 THEN 'medium'
        ELSE 'small' END
        AS popsize_group
-- 1. Into table        
INTO countries.pop_plus
FROM populations
WHERE year = 2015;

Run this below

  
-- 2. Select all columns of pop_plus
SELECT * FROM countries.pop_plus;

  
sql_stmt = "\
SELECT country_code, size, \
CASE WHEN size > 50000000 THEN 'large' \
WHEN size > 1000000 THEN 'medium' \
ELSE 'small' END \
AS popsize_group \
INTO countries.pop_plus \
FROM countries.populations \
WHERE year = 2015;"

conn.execute(text(sql_stmt))

<sqlalchemy.engine.cursor.CursorResult at 0x27c993825f0>

  
sql_stmt = "\
SELECT * FROM countries.pop_plus; \
"

pd.read_sql(sql_stmt, conn).head()

	country_code	size	popsize_group
0	ABW	103889.0	small
1	AFG	32526562.0	medium
2	AGO	25021974.0	medium
3	ALB	2889167.0	medium
4	AND	70473.0	small

Instructions 3/3

Keep the first query intact that creates pop_plus using INTO.
Write a query to join countries_plus AS c on the left with pop_plus AS p on the right matching on the country code fields.
Sort the data based on geosize_group, in ascending order so that large appears on top.
Select the name, continent, geosize_group, and popsize_group fields.

  
sql_stmt = "\
SELECT c.name, c.continent, c.geosize_group, p.popsize_group \
FROM countries.countries_plus AS c \
INNER JOIN countries.pop_plus AS p \
ON c.code = p.country_code \
ORDER BY geosize_group ASC \
"

q_df = pd.read_sql(sql_stmt, conn)
q_df.head()

	name	continent	geosize_group	popsize_group
0	India	Asia	large	large
1	United States	North America	large	large
2	Saudi Arabia	Asia	large	medium
3	China	Asia	large	large
4	Kazakhstan	Asia	large	medium

  
q_df.tail()

	name	continent	geosize_group	popsize_group
201	Guam	Oceania	small	small
202	Guyana	South America	small	small
203	Hong Kong	Asia	small	medium
204	Honduras	North America	small	medium
205	Croatia	Europe	small	medium

Outer JOINs and Cross JOINs

In this chapter, you’ll come to grips with different kinds of outer joins. You’ll learn how to gain further insights into your data through left joins, right joins, and full joins. In addition to outer joins, you’ll also work with cross joins.

LEFT and RIGHT JOINs

You can remember outer joins as reaching out to another table while keeping all of the records of the original table.
Inner joins keep only the records in both tables.
This chapter will explore three types of OUTER JOINs:
1. LEFT JOINs
2. RIGHT JOINs
3. FULL JOINs
How a LEFT JOIN differs from an INNER JOIN:

INNER JOIN

The only records included in the resulting table of the INNER JOIN query were those in which the id field had matching values.

  
SELECT p1.country, prime_minister, president
FROM prime_ministers as p1
INNER JOIN presidents p2
ON p1.country = p2.country;

LEFT JOIN

In contrast, a LEFT JOIN notes those record in the left table that do not have a match on the key field in the right table.
This is denoted in the diagram by the open circles remaining close to the left table for id values of 2 and 3.
Whereas the INNER JOIN kept just the records corresponding to id values 1 and 44, a LEFT JOIN keeps all of the original records in the left table, but then marks the values as missing in the right table for those that don’t have a match.
The syntax of the LEFT JOIN is similar to that of the INNER JOIN.

  
SELECT p1.country, prime_minister, president
FROM prime_ministers as p1
LEFT JOIN presidents p2
ON p1.country = p2.country;

LEFT JOIN multiple matches

It isn’t always the case that each key value in the left table corresponds to exactly one record in the key column of the right table.
Duplicate rows are shown in the LEFT JOIN for id 1 since it has two matches corresponding to the values of R1 and R2 in the right2 table.

RIGHT JOIN

Instead of matching entries in the id column on the left table to the id column on the right table, a RIGHT JOIN does the reverse.
- The resulting table from a RIGHT JOIN shows the missing entries in the L_val field.
In the SQL statement the right table appears after RIGHT JOIN and the left table appears after FROM.

  
SELECT right_table.id AS R_id,
       left_table.val AS L_val,
       right_talbe.vale AS R_val
FROM left_table
RIGHT JOIN right_table
ON left_table.id = right_table.id;

INNER JOIN

  
sql_stmt = "\
SELECT p1.country, prime_minister, president \
FROM leaders.prime_ministers as p1 \
INNER JOIN leaders.presidents as p2 \
ON p1.country = p2.country \
"

pd.read_sql(sql_stmt, conn)

	country	prime_minister	president
0	Egypt	Sherif Ismail	Abdel Fattah el-Sisi
1	Portugal	Antonio Costa	Marcelo Rebelo de Sousa
2	Vietnam	Nguyen Xuan Phuc	Tran Dai Quang
3	Haiti	Jack Guy Lafontant	Jovenel Moise

LEFT JOIN

The first four records are the same as those from INNER JOIN
The following records correspond to the countries that do not have a president and thus their president values are missing.

  
sql_stmt = "\
SELECT p1.country, prime_minister, president \
FROM leaders.prime_ministers as p1 \
LEFT JOIN leaders.presidents as p2 \
ON p1.country = p2.country \
"

pd.read_sql(sql_stmt, conn)

	country	prime_minister	president
0	Egypt	Sherif Ismail	Abdel Fattah el-Sisi
1	Portugal	Antonio Costa	Marcelo Rebelo de Sousa
2	Vietnam	Nguyen Xuan Phuc	Tran Dai Quang
3	Haiti	Jack Guy Lafontant	Jovenel Moise
4	India	Narendra Modi	None
5	Australia	Malcolm Turnbull	None
6	Norway	Erna Solberg	None
7	Brunei	Hassanal Bolkiah	None
8	Oman	Qaboos bin Said al Said	None
9	Spain	Mariano Rajoy	None

Exercises

LEFT JOIN

Now you’ll explore the differences between performing an inner join and a left join using the cities and countries tables.

You’ll begin by performing an inner join with the cities table on the left and the countries table on the right. Remember to alias the name of the city field as city and the name of the country field as country.

You will then change the query to a left join. Take note of how many records are in each query here!

Instructions 1/2

Fill in the code based on the instructions in the code comments to complete the inner join. Note how many records are in the result of the join in the query result tab.

  
-- Select the city name (with alias), the country code,
-- the country name (with alias), the region,
-- and the city proper population
SELECT c1.name AS city, code, c2.name AS country,
       region, city_proper_pop
-- From left table (with alias)
FROM cities AS c1
  -- Join to right table (with alias)
  INNER JOIN countries AS c2
    -- Match on country code
    ON c1.country_code = c2.code
-- Order by descending country code
ORDER BY code DESC;

  
sql_stmt = "\
SELECT c1.name AS city, code, c2.name AS country, region, city_proper_pop \
FROM countries.cities AS c1 \
INNER JOIN countries.countries AS c2 \
ON c1.country_code = c2.code \
ORDER BY code DESC; \
"

pd.read_sql(sql_stmt, conn).head()

	city	code	country	region	city_proper_pop
0	Harare	ZWE	Zimbabwe	Eastern Africa	1606000.0
1	Lusaka	ZMB	Zambia	Eastern Africa	1742979.0
2	Cape Town	ZAF	South Africa	Southern Africa	3740026.0
3	Johannesburg	ZAF	South Africa	Southern Africa	4434827.0
4	Durban	ZAF	South Africa	Southern Africa	3442361.0

Instructions 2/2

Change the code to perform a LEFT JOIN instead of an INNER JOIN. After executing this query, note how many records the query result contains.

  
SELECT c1.name AS city, code, c2.name AS country,
       region, city_proper_pop
FROM cities AS c1
  -- 1. Join right table (with alias)
  LEFT JOIN countries AS c2
    -- 2. Match on country code
    ON c1.country_code = c2.code
-- 3. Order by descending country code
ORDER BY code DESC;

  
sql_stmt = "\
SELECT c1.name AS city, code, c2.name AS country, region, city_proper_pop \
FROM countries.cities AS c1 \
LEFT JOIN countries.countries AS c2 \
ON c1.country_code = c2.code \
ORDER BY code DESC; \
"

pd.read_sql(sql_stmt, conn).head()

	city	code	country	region	city_proper_pop
0	Taichung	None	None	None	2752413.0
1	Tainan	None	None	None	1885252.0
2	Kaohsiung	None	None	None	2778918.0
3	Bucharest	None	None	None	1883425.0
4	Taipei	None	None	None	2704974.0

JEFT JOIN (2)

Next, you’ll try out another example comparing an inner join to its corresponding left join. Before you begin though, take note of how many records are in both the countries and languages tables below.

You will begin with an inner join on the countries table on the left with the languages table on the right. Then you’ll change the code to a left join in the next bullet.

Note the use of multi-line comments here using /* and */.

Instructions 1/2

Perform an inner join. Alias the name of the country field as country and the name of the language field as language.
Sort based on desc

  
sql_stmt = "\
SELECT c.name AS country, local_name, l.name AS language, percent \
FROM countries.countries AS c \
INNER JOIN countries.languages AS l \
ON c.code = l.code \
ORDER BY country desc; \
"

res1 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res1)}')
res1.head()

Number of Records: 914

	country	local_name	language	percent
0	Zimbabwe	Zimbabwe	Shona	NaN
1	Zimbabwe	Zimbabwe	Tonga	NaN
2	Zimbabwe	Zimbabwe	Tswana	NaN
3	Zimbabwe	Zimbabwe	Venda	NaN
4	Zimbabwe	Zimbabwe	Xhosa	NaN

Instructions 2/2

Perform a left join instead of an inner join. Observe the result, and also note the change in the number of records in the result.
Carefully review which records appear in the left join result, but not in the inner join result.

  
sql_stmt = "\
SELECT c.name AS country, local_name, l.name AS language, percent \
FROM countries.countries AS c \
LEFT JOIN countries.languages AS l \
ON c.code = l.code \
ORDER BY country desc; \
"

res2 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res2)}')
res2.head()

Number of Records: 921

	country	local_name	language	percent
0	Zimbabwe	Zimbabwe	Chibarwe	NaN
1	Zimbabwe	Zimbabwe	Shona	NaN
2	Zimbabwe	Zimbabwe	Ndebele	NaN
3	Zimbabwe	Zimbabwe	English	NaN
4	Zimbabwe	Zimbabwe	Chewa	NaN

LEFT JOIN (3)

Left join (3) You’ll now revisit the use of the AVG() function introduced in our Intro to SQL for Data Science course. You will use it in combination with left join to determine the average gross domestic product (GDP) per capita by region in 2010.

Instructions 1/3

Begin with a left join with the countries table on the left and the economies table on the right.
Focus only on records with 2010 as the year.

  
sql_stmt = "\
SELECT name, region, gdp_percapita \
FROM countries.countries AS c \
LEFT JOIN countries.economies AS e \
ON e.code = c.code \
WHERE year = 2010; \
"

res1 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res1)}')
res1.head()

Number of Records: 185

	name	region	gdp_percapita
0	Afghanistan	Southern and Central Asia	539.667
1	Angola	Central Africa	3599.270
2	Albania	Southern Europe	4098.130
3	United Arab Emirates	Middle East	34628.630
4	Argentina	South America	10412.950

Instructions 2/3

Modify your code to calculate the average GDP per capita AS avg_gdp for each region in 2010.
Select the region and avg_gdp fields.

  
sql_stmt = "\
SELECT region, AVG(gdp_percapita) as avg_gdp \
FROM countries.countries AS c \
LEFT JOIN countries.economies AS e \
ON e.code = c.code \
WHERE year = 2010 \
GROUP BY region \
ORDER BY avg_gdp DESC; \
"

res2 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res2)}')
res2.head()

Number of Records: 23

	region	avg_gdp
0	Western Europe	58130.961496
1	Nordic Countries	57073.997656
2	North America	47911.509766
3	Australia and New Zealand	44792.384766
4	British Islands	43588.330078

Instructions 3/3

Arrange this data on average GDP per capita for each region in 2010 from highest to lowest average GDP per capita.

  
sql_stmt = "\
SELECT region, AVG(gdp_percapita) as avg_gdp \
FROM countries.countries AS c \
LEFT JOIN countries.economies AS e \
ON e.code = c.code \
WHERE year = 2010 \
GROUP BY region; \
"

res3 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res3)}')
res3.head()

Number of Records: 23

	region	avg_gdp
0	Southern Africa	5051.597974
1	Caribbean	11413.339454
2	Eastern Africa	1757.348162
3	Southern Europe	22926.410911
4	Eastern Asia	26205.851400

RIGHT JOIN

Right joins aren’t as common as left joins. One reason why is that you can always write a right join as a left join.

Instructions

The left join code is commented out here. Your task is to write a new query using rights joins that produces the same result as what the query using left joins produces. Keep this left joins code commented as you write your own query just below it using right joins to solve the problem.
Note the order of the joins matters in your conversion to using right joins!
convert this code to use RIGHT JOINs instead of LEFT JOINs

  
SELECT cities.name AS city, urbanarea_pop, countries.name AS country,
       indep_year, languages.name AS language, percent
FROM cities
  LEFT JOIN countries
    ON cities.country_code = countries.code
  LEFT JOIN languages
    ON countries.code = languages.code
ORDER BY city, language;

  
sql_stmt = "\
SELECT cities.name AS city, urbanarea_pop, countries.name AS country, \
indep_year, languages.name AS language, percent \
FROM countries.languages \
RIGHT JOIN countries.countries \
ON languages.code = countries.code \
RIGHT JOIN countries.cities \
ON cities.country_code = countries.code \
ORDER BY city, language; \
"

pd.read_sql(sql_stmt, conn).head()

	city	urbanarea_pop	country	indep_year	language	percent
0	Abidjan	4765000.0	Cote d'Ivoire	1960.0	French	NaN
1	Abidjan	4765000.0	Cote d'Ivoire	1960.0	Other	NaN
2	Abu Dhabi	1145000.0	United Arab Emirates	1971.0	Arabic	NaN
3	Abu Dhabi	1145000.0	United Arab Emirates	1971.0	English	NaN
4	Abu Dhabi	1145000.0	United Arab Emirates	1971.0	Hindi	NaN

FULL JOINs

The last of the three types of OUTER JOINs is the FULL JOIN
Explore the difference between FULL JOIN and other JOINs
The instruction will focus on comparing them to INNER JOINs and LEFT JOINs and then to LEFT JOINs and RIGHT JOINs.
Let’s review how the diagram changes between and INNER JOIN and a LEFT JOIN for our basic example using the left and right tables.
Then we’ll delve into the FULL JOIN diagram and is SQL code.
Recall that an INNER JOIN keeps only the records that have matching key field values in both tables.
A LEFT JOIN keeps all of the records in the left table while bringing in missing values for those key field values that don’t appear in the right table.
Let’s review the differences between a LEFT JOIN and a RIGHT JOIN.
The id values of 2 and 3 in the left table do not match with the id values in the right table, so missing values are brought in for them in the LEFT JOIN.
Likewise for the RIGHT JOIN, missing values are brought in for id values of 5 and 6.
A FULL JOIN combines a LEFT JOIN and RIGHT JOIN as you can see in the diagram.
It will bring in all record from both the left and the right table and keep track of the missing values accordingly.
Note the missing values here and all six of the values of id are included in the table.
You can also see from the SQL code, to produce this FULL JOIN result, the general format aligns closely with the SQL syntax seen for an INNER JOIN and a LEFT JOIN.

  
SELECT left_table.id AS L_id,
       right_table.id AS R_id,
       left_table.val AS L_val,
       right_table.val as R_val,
FULL left_table
FULL JOIN right_table
USING (id);

FULL JOIN example using leaders database

Let’s revisit the example of looking at countries with prime ministers and / or presidents.
Query breakdown:
The SELECT statement includes the country field from both tables of interest and also the prime_minister and president fields.
The left table is specified as prime_ministers with the alias of p1
The order matters and if you switched the two tables, the output would be slightly different.
The right table is specified as presidents with the alias of p2
The join is done based on the key field of country in both tables

  
SELECT p1.country AS pm_co, p2.country AS pres_co, prime_minister, president
FROM prime_ministers AS p1
FULL JOIN presidents AS p2
ON p1.country = p2.country;

  
sql_stmt = "\
SELECT p1.country AS pm_co, p2.country AS pres_co, prime_minister, president \
FROM leaders.prime_ministers AS p1 \
FULL JOIN leaders.presidents AS p2 \
ON p1.country = p2.country; \
"

pd.read_sql(sql_stmt, conn)

	pm_co	pres_co	prime_minister	president
0	Egypt	Egypt	Sherif Ismail	Abdel Fattah el-Sisi
1	Portugal	Portugal	Antonio Costa	Marcelo Rebelo de Sousa
2	Vietnam	Vietnam	Nguyen Xuan Phuc	Tran Dai Quang
3	Haiti	Haiti	Jack Guy Lafontant	Jovenel Moise
4	India	None	Narendra Modi	None
5	Australia	None	Malcolm Turnbull	None
6	Norway	None	Erna Solberg	None
7	Brunei	None	Hassanal Bolkiah	None
8	Oman	None	Qaboos bin Said al Said	None
9	Spain	None	Mariano Rajoy	None
10	None	Uruguay	None	Jose Mujica
11	None	Chile	None	Michelle Bachelet
12	None	Liberia	None	Ellen Johnson Sirleaf

Exercises

FULL JOIN

In this exercise, you’ll examine how your results differ when using a full join versus using a left join versus using an inner join with the countries and currencies tables.

You will focus on the North American region and also where the name of the country is missing. Dig in to see what we mean!

Begin with a full join with countries on the left and currencies on the right. The fields of interest have been SELECTed for you throughout this exercise.

Then complete a similar left join and conclude with an inner join.

Instructions 1/3

Choose records in which region corresponds to North America or is NULL.

  
SELECT name AS country, code, region, basic_unit
-- 3. From countries
FROM countries
  -- 4. Join to currencies
  FULL JOIN currencies
    -- 5. Match on code
    USING (code)
-- 1. Where region is North America or null
WHERE region = 'North America' OR region IS null
-- 2. Order by region
ORDER BY region;

  
sql_stmt = "\
SELECT name AS country, code, region, basic_unit \
FROM countries.countries \
  FULL JOIN countries.currencies \
    USING (code) \
WHERE region = 'North America' OR region IS null \
ORDER BY region; \
"

pd.read_sql(sql_stmt, conn)

	country	code	region	basic_unit
0	Canada	CAN	North America	Canadian dollar
1	United States	USA	North America	United States dollar
2	Bermuda	BMU	North America	Bermudian dollar
3	Greenland	GRL	North America	None
4	None	TMP	None	United States dollar
5	None	FLK	None	Falkland Islands pound
6	None	AIA	None	East Caribbean dollar
7	None	NIU	None	New Zealand dollar
8	None	ROM	None	Romanian leu
9	None	SHN	None	Saint Helena pound
10	None	SGS	None	British pound
11	None	TWN	None	New Taiwan dollar
12	None	WLF	None	CFP franc
13	None	MSR	None	East Caribbean dollar
14	None	IOT	None	United States dollar
15	None	CCK	None	Australian dollar
16	None	COK	None	New Zealand dollar

Instructions 2/3

Repeat the same query as above but use a LEFT JOIN instead of a FULL JOIN. Note what has changed compared to the FULL JOIN result!

  
SELECT name AS country, code, region, basic_unit
-- 3. From countries
FROM countries
  -- 4. Join to currencies
  LEFT JOIN currencies
    -- 5. Match on code
    USING (code)
-- 1. Where region is North America or null
WHERE region = 'North America' OR region IS null
-- 2. Order by region
ORDER BY region;

  
sql_stmt = "\
SELECT name AS country, code, region, basic_unit \
FROM countries.countries \
  LEFT JOIN countries.currencies \
    USING (code) \
WHERE region = 'North America' OR region IS null \
ORDER BY region; \
"

pd.read_sql(sql_stmt, conn)

	country	code	region	basic_unit
0	Bermuda	BMU	North America	Bermudian dollar
1	Canada	CAN	North America	Canadian dollar
2	United States	USA	North America	United States dollar
3	Greenland	GRL	North America	None

Instruction 3/3

Repeat the same query as above but use an INNER JOIN instead of a FULL JOIN. Note what has changed compared to the FULL JOIN and LEFT JOIN results!

  
SELECT name AS country, code, region, basic_unit
-- 3. From countries
FROM countries
  -- 4. Join to currencies
  INNER JOIN currencies
    -- 5. Match on code
    USING (code)
-- 1. Where region is North America or null
WHERE region = 'North America' OR region IS null
-- 2. Order by region
ORDER BY region;

  
sql_stmt = "\
SELECT name AS country, code, region, basic_unit \
FROM countries.countries \
  INNER JOIN countries.currencies \
    USING (code) \
WHERE region = 'North America' OR region IS null \
ORDER BY region; \
"

pd.read_sql(sql_stmt, conn)

	country	code	region	basic_unit
0	Bermuda	BMU	North America	Bermudian dollar
1	Canada	CAN	North America	Canadian dollar
2	United States	USA	North America	United States dollar

Have you kept an eye out on the different numbers of records these queries returned? The FULL JOIN query returned 17 rows, the LEFT JOIN returned 4 rows, and the INNER JOIN only returned 3 rows. Do these results make sense to you?

FULL JOIN (2)

You’ll now investigate a similar exercise to the last one, but this time focused on using a table with more records on the left than the right. You’ll work with the languages and countries tables.

Begin with a full join with languages on the left and countries on the right. Appropriate fields have been selected for you again here.

Instructions 1/3

Choose records in which countries.name starts with the capital letter 'V' or is NULL and arrange by countries.name in ascending order to more clearly see the results.

  
SELECT countries.name, code, languages.name AS language
-- 3. From languages
FROM languages
  -- 4. Join to countries
  FULL JOIN countries
    -- 5. Match on code
    USING (code)
-- 1. Where countries.name starts with V or is null
WHERE countries.name LIKE 'V%' OR countries.name IS null
-- 2. Order by ascending countries.name
ORDER BY countries.name;

  
sql_stmt = "\
SELECT countries.name, code, languages.name AS language \
FROM countries.languages \
  FULL JOIN countries.countries \
    USING (code) \
WHERE countries.name LIKE 'V%%' OR countries.name IS null \
ORDER BY countries.name; \
"

pd.read_sql(sql_stmt, conn)

	name	code	language
0	Vanuatu	VUT	Tribal Languages
1	Vanuatu	VUT	English
2	Vanuatu	VUT	French
3	Vanuatu	VUT	Other
4	Vanuatu	VUT	Bislama
5	Venezuela	VEN	Spanish
6	Venezuela	VEN	indigenous
7	Vietnam	VNM	Vietnamese
8	Vietnam	VNM	English
9	Vietnam	VNM	Other
10	Virgin Islands, British	VGB	None
11	Virgin Islands, U.S.	VIR	None
12	None	NIU	English
13	None	NIU	Niuean
14	None	NIU	Other
15	None	NFK	English
16	None	NFK	Other
17	None	ROM	Romanian
18	None	ROM	Hungarian
19	None	ROM	Romani
20	None	ROM	Other
21	None	ROM	unspecified
22	None	SPM	French
23	None	TWN	Mandarin
24	None	TWN	Taiwanese
25	None	TWN	Hakka
26	None	TKL	Tokelauan
27	None	TKL	English
28	None	TKL	Samoan
29	None	TKL	Tuvaluan
30	None	TKL	Kiribati
31	None	TKL	Other
32	None	AIA	English
33	None	TKL	unspecified
34	None	WLF	Wallisian
35	None	WLF	Futunian
36	None	WLF	French
37	None	WLF	Other
38	None	ESH	Standard
39	None	ESH	Hassaniya
40	None	ESH	Moroccan
41	None	TKL	none
42	None	CXR	English
43	None	CXR	Chinese
44	None	CXR	Malay
45	None	CCK	Malay
46	None	CCK	English
47	None	COK	English
48	None	COK	Rarotongan
49	None	COK	Other
50	None	MSR	English
51	None	NIU	Niuean
52	None	NIU	Niuean

Instructions 2/3

Repeat the same query as above but use a left join instead of a full join. Note what has changed compared to the full join result!

  
SELECT countries.name, code, languages.name AS language
-- 3. From languages
FROM languages
  -- 4. Join to countries
  LEFT JOIN countries
    -- 5. Match on code
    USING (code)
-- 1. Where countries.name starts with V or is null
WHERE countries.name LIKE 'V%' OR countries.name IS null
-- 2. Order by ascending countries.name
ORDER BY countries.name;

  
sql_stmt = "\
SELECT countries.name, code, languages.name AS language \
FROM countries.languages \
  LEFT JOIN countries.countries \
    USING (code) \
WHERE countries.name LIKE 'V%%' OR countries.name IS null \
ORDER BY countries.name; \
"

pd.read_sql(sql_stmt, conn)

	name	code	language
0	Vanuatu	VUT	English
1	Vanuatu	VUT	Other
2	Vanuatu	VUT	French
3	Vanuatu	VUT	Tribal Languages
4	Vanuatu	VUT	Bislama
5	Venezuela	VEN	indigenous
6	Venezuela	VEN	Spanish
7	Vietnam	VNM	English
8	Vietnam	VNM	Vietnamese
9	Vietnam	VNM	Other
10	None	NIU	Niuean
11	None	NIU	Niuean
12	None	NIU	English
13	None	NIU	Niuean
14	None	NIU	Other
15	None	NFK	English
16	None	NFK	Other
17	None	ROM	Romanian
18	None	ROM	Hungarian
19	None	ROM	Romani
20	None	ROM	Other
21	None	ROM	unspecified
22	None	SPM	French
23	None	TWN	Mandarin
24	None	TWN	Taiwanese
25	None	TWN	Hakka
26	None	TKL	Tokelauan
27	None	TKL	English
28	None	TKL	Samoan
29	None	TKL	Tuvaluan
30	None	TKL	Kiribati
31	None	TKL	Other
32	None	TKL	none
33	None	TKL	unspecified
34	None	WLF	Wallisian
35	None	WLF	Futunian
36	None	WLF	French
37	None	WLF	Other
38	None	ESH	Standard
39	None	ESH	Hassaniya
40	None	AIA	English
41	None	ESH	Moroccan
42	None	CXR	English
43	None	CXR	Chinese
44	None	CXR	Malay
45	None	CCK	Malay
46	None	CCK	English
47	None	COK	English
48	None	COK	Rarotongan
49	None	COK	Other
50	None	MSR	English

Instructions 3/3

Repeat once more, but use an inner join instead of a left join. Note what has changed compared to the full join and left join results.

  
SELECT countries.name, code, languages.name AS language
-- 3. From languages
FROM languages
  -- 4. Join to countries
  INNER JOIN countries
    -- 5. Match on code
    USING (code)
-- 1. Where countries.name starts with V or is null
WHERE countries.name LIKE 'V%' OR countries.name IS null
-- 2. Order by ascending countries.name
ORDER BY countries.name;

  
sql_stmt = "\
SELECT countries.name, code, languages.name AS language \
FROM countries.languages \
  INNER JOIN countries.countries \
    USING (code) \
WHERE countries.name LIKE 'V%%' OR countries.name IS null \
ORDER BY countries.name; \
"

pd.read_sql(sql_stmt, conn)

	name	code	language
0	Vanuatu	VUT	Tribal Languages
1	Vanuatu	VUT	Bislama
2	Vanuatu	VUT	English
3	Vanuatu	VUT	French
4	Vanuatu	VUT	Other
5	Venezuela	VEN	Spanish
6	Venezuela	VEN	indigenous
7	Vietnam	VNM	Vietnamese
8	Vietnam	VNM	English
9	Vietnam	VNM	Other

FULL JOIN (3)

You’ll now explore using two consecutive full joins on the three tables you worked with in the previous two exercises.

Instructions

Complete a full join with countries on the left and languages on the right.
Next, full join this result with currencies on the right.
Use LIKE to choose the Melanesia and Micronesia regions (Hint: 'M%esia').
Select the fields corresponding to the country name AS country, region, language name AS language, and basic and fractional units of currency.

  
-- 7. Select fields (with aliases)
SELECT c1.name AS country, region, l.name AS language,
       basic_unit, frac_unit
-- 1. From countries (alias as c1)
FROM countries AS c1
  -- 2. Join with languages (alias as l)
  FULL JOIN languages AS l
    -- 3. Match on code
    USING (code)
  -- 4. Join with currencies (alias as c2)
  FULL JOIN currencies AS c2
    -- 5. Match on code
    USING (code)
-- 6. Where region like Melanesia and Micronesia
WHERE region LIKE 'M%esia';

  
sql_stmt = "\
SELECT c1.name AS country, region, l.name AS language, \
       basic_unit, frac_unit \
FROM countries.countries AS c1 \
  FULL JOIN countries.languages AS l \
    USING (code) \
  FULL JOIN countries.currencies AS c2 \
    USING (code) \
WHERE region LIKE 'M%%esia'; \
"

pd.read_sql(sql_stmt, conn)

	country	region	language	basic_unit	frac_unit
0	Kiribati	Micronesia	English	Australian dollar	Cent
1	Kiribati	Micronesia	Kiribati	Australian dollar	Cent
2	Marshall Islands	Micronesia	Other	United States dollar	Cent
3	Marshall Islands	Micronesia	Marshallese	United States dollar	Cent
4	Nauru	Micronesia	Other	Australian dollar	Cent
5	Nauru	Micronesia	English	Australian dollar	Cent
6	Nauru	Micronesia	Nauruan	Australian dollar	Cent
7	New Caledonia	Melanesia	Other	CFP franc	Centime
8	New Caledonia	Melanesia	French	CFP franc	Centime
9	Palau	Micronesia	Other	United States dollar	Cent
10	Palau	Micronesia	Chinese	United States dollar	Cent
11	Palau	Micronesia	Filipino	United States dollar	Cent
12	Palau	Micronesia	English	United States dollar	Cent
13	Palau	Micronesia	Other	United States dollar	Cent
14	Palau	Micronesia	Palauan	United States dollar	Cent
15	Papua New Guinea	Melanesia	Other	Papua New Guinean kina	Toea
16	Papua New Guinea	Melanesia	Hiri	Papua New Guinean kina	Toea
17	Papua New Guinea	Melanesia	English	Papua New Guinean kina	Toea
18	Papua New Guinea	Melanesia	Tok Pisin	Papua New Guinean kina	Toea
19	Solomon Islands	Melanesia	indigenous	Solomon Islands dollar	Cent
20	Solomon Islands	Melanesia	English	Solomon Islands dollar	Cent
21	Solomon Islands	Melanesia	Melanesian pidgin	Solomon Islands dollar	Cent
22	Vanuatu	Melanesia	Other	Vanuatu vatu	None
23	Vanuatu	Melanesia	French	Vanuatu vatu	None
24	Vanuatu	Melanesia	English	Vanuatu vatu	None
25	Vanuatu	Melanesia	Bislama	Vanuatu vatu	None
26	Vanuatu	Melanesia	Tribal Languages	Vanuatu vatu	None
27	Micronesia, Federated States of	Micronesia	Kapingamarangi	None	None
28	Micronesia, Federated States of	Micronesia	Nukuoro	None	None
29	Micronesia, Federated States of	Micronesia	Woleaian	None	None
30	Micronesia, Federated States of	Micronesia	Ulithian	None	None
31	Micronesia, Federated States of	Micronesia	Yapese	None	None
32	Micronesia, Federated States of	Micronesia	Pohnpeian	None	None
33	Micronesia, Federated States of	Micronesia	Kosrean	None	None
34	Micronesia, Federated States of	Micronesia	Chuukese	None	None
35	Micronesia, Federated States of	Micronesia	English	None	None
36	Fiji Islands	Melanesia	None	None	None
37	Northern Mariana Islands	Micronesia	Other	None	None
38	Northern Mariana Islands	Micronesia	Other Asian	None	None
39	Northern Mariana Islands	Micronesia	Chinese	None	None
40	Northern Mariana Islands	Micronesia	Other Pacific Island	None	None
41	Northern Mariana Islands	Micronesia	English	None	None
42	Northern Mariana Islands	Micronesia	Chamorro	None	None
43	Northern Mariana Islands	Micronesia	Philippine	None	None
44	Guam	Micronesia	Other	None	None
45	Guam	Micronesia	Asian	None	None
46	Guam	Micronesia	Other Pacific Islander	None	None
47	Guam	Micronesia	Chamorro	None	None
48	Guam	Micronesia	Filipino	None	None
49	Guam	Micronesia	English	None	None

Review OUTER JOINs

A(n) ___ join is a join combining the results of a ___ join and a ___ join.

Answer the question

~~left, full, right~~
~~right, full, left~~
~~inner, left, right~~
**None of the above are true**

CROSSing the rubicon

CROSS JOINs create all possible combinations of two tables.
The resulting table is comprised of all 9 combinations if id from table1 and id from table2 (e.g. 1(A-C), 2(A-C), & 3(A-C))

CROSS JOIN example: Pairing prime ministers with presidents

Suppose all prime ministers in North America and Oceania in the prime_ministers table are scheduled for individual meetings with all presidents in the presidents table.
All the combinations can be created with a CROSS JOIN

  
SELECT prime_minister, president
FROM prime_ministers AS p1
CROSS JOIN presidents AS p2
WHERE p1.continent IN ('North America', 'Oceania');

  
sql_stmt = "\
SELECT prime_minister, president \
FROM leaders.prime_ministers AS p1 \
CROSS JOIN leaders.presidents AS p2 \
WHERE p1.continent IN ('North America', 'Oceania'); \
"

pd.read_sql(sql_stmt, conn)

	prime_minister	president
0	Jack Guy Lafontant	Abdel Fattah el-Sisi
1	Malcolm Turnbull	Abdel Fattah el-Sisi
2	Jack Guy Lafontant	Marcelo Rebelo de Sousa
3	Malcolm Turnbull	Marcelo Rebelo de Sousa
4	Jack Guy Lafontant	Jovenel Moise
5	Malcolm Turnbull	Jovenel Moise
6	Jack Guy Lafontant	Jose Mujica
7	Malcolm Turnbull	Jose Mujica
8	Jack Guy Lafontant	Ellen Johnson Sirleaf
9	Malcolm Turnbull	Ellen Johnson Sirleaf
10	Jack Guy Lafontant	Michelle Bachelet
11	Malcolm Turnbull	Michelle Bachelet
12	Jack Guy Lafontant	Tran Dai Quang
13	Malcolm Turnbull	Tran Dai Quang

Exercises

A table of two cities

This exercise looks to explore languages potentially and most frequently spoken in the cities of Hyderabad, India and Hyderabad, Pakistan.

You will begin with a cross join with cities AS c on the left and languages AS l on the right. Then you will modify the query using an inner join in the next tab.

Instructions 1/2

Create the cross join as described above. (Recall that cross joins do not use ON or USING.)
Make use of LIKE and Hyder% to choose Hyderabad in both countries.
Select only the city name AS city and language name AS language.

  
-- 4. Select fields
SELECT c.name AS city, l.name AS language
-- 1. From cities (alias as c)
FROM cities AS c        
  -- 2. Join to languages (alias as l)
  CROSS JOIN languages AS l
-- 3. Where c.name like Hyderabad
WHERE c.name LIKE 'Hyder%';

  
sql_stmt = "\
SELECT c.name AS city, l.name AS language \
FROM countries.cities AS c \
  CROSS JOIN countries.languages AS l \
WHERE c.name LIKE 'Hyder%%'; \
"

hyderabad_lang = pd.read_sql(sql_stmt, conn)
hyderabad_lang

	city	language
0	Hyderabad (India)	Dari
1	Hyderabad	Dari
2	Hyderabad (India)	Pashto
3	Hyderabad	Pashto
4	Hyderabad (India)	Turkic
...	...	...
1905	Hyderabad	Tswana
1906	Hyderabad (India)	Venda
1907	Hyderabad	Venda
1908	Hyderabad (India)	Xhosa
1909	Hyderabad	Xhosa

1910 rows × 2 columns

  
unique_lang = hyderabad_lang['language'].unique()
print(len(unique_lang))

396

Instructions 2/2

Use an inner join instead of a cross join. Think about what the difference will be in the results for this inner join result and the one for the cross join.

  
-- 5. Select fields
SELECT c.name AS city, l.name AS language
-- 1. From cities (alias as c)
FROM cities as c      
  -- 2. Join to languages (alias as l)
  INNER JOIN languages AS l
    -- 3. Match on country code
    ON c.country_code = l.code
-- 4. Where c.name like Hyderabad
WHERE c.name like 'Hyder%';

  
sql_stmt = "\
SELECT c.name AS city, l.name AS language \
FROM countries.cities AS c \
  INNER JOIN countries.languages AS l \
    ON c.country_code = l.code \
WHERE c.name LIKE 'Hyder%%'; \
"

pd.read_sql(sql_stmt, conn)

	city	language
0	Hyderabad (India)	Hindi
1	Hyderabad (India)	Bengali
2	Hyderabad (India)	Telugu
3	Hyderabad (India)	Marathi
4	Hyderabad (India)	Tamil
5	Hyderabad (India)	Urdu
6	Hyderabad (India)	Gujarati
7	Hyderabad (India)	Kannada
8	Hyderabad (India)	Malayalam
9	Hyderabad (India)	Oriya
10	Hyderabad (India)	Punjabi
11	Hyderabad (India)	Assamese
12	Hyderabad (India)	Maithili
13	Hyderabad (India)	Other
14	Hyderabad	Punjabi
15	Hyderabad	Sindhi
16	Hyderabad	Saraiki
17	Hyderabad	Pashto
18	Hyderabad	Urdu
19	Hyderabad	Balochi
20	Hyderabad	Hindko
21	Hyderabad	Brahui
22	Hyderabad	English
23	Hyderabad	Burushaski
24	Hyderabad	Other

Outer challenge

Now that you’re fully equipped to use outer joins, try a challenge problem to test your knowledge!

In terms of life expectancy for 2010, determine the names of the lowest five countries and their regions.

Instructions

Select country name AS country, region, and life expectancy AS life_exp.
Make sure to use LEFT JOIN, WHERE, ORDER BY, and LIMIT.

  
-- Select fields
SELECT c.name AS country, c.region, p.life_expectancy AS life_exp
-- From countries (alias as c)
FROM countries AS c
  -- Join to populations (alias as p)
  LEFT JOIN populations AS p
    -- Match on country code
    ON c.code = p.country_code
-- Focus on 2010
WHERE p.year = 2010
-- Order by life_exp
ORDER BY life_exp
-- Limit to 5 records
LIMIT 5;

  
sql_stmt = "\
SELECT c.name AS country, c.region, p.life_expectancy AS life_exp \
FROM countries.countries AS c \
  LEFT JOIN countries.populations AS p \
    ON c.code = p.country_code \
WHERE p.year = 2010 \
ORDER BY life_exp \
LIMIT 5; \
"

pd.read_sql(sql_stmt, conn)

	country	region	life_exp
0	Lesotho	Southern Africa	47.483414
1	Central African Republic	Central Africa	47.625317
2	Sierra Leone	Western Africa	48.228950
3	Swaziland	Southern Africa	48.345757
4	Zimbabwe	Eastern Africa	49.574657

Set theory clauses

In this chapter, you’ll learn more about set theory using Venn diagrams and you will be introduced to union, union all, intersect, and except clauses. You’ll finish by investigating semi-joins and anti-joins, which provide a nice introduction to subqueries.

State of the UNION

Focus on the operations UNION and UNION ALL.
In addition to joining diagrams, you’ll see how Venn diagrams can be used to represent set operations.
Think of each circle as representing a table of data
The shading represents what’s included in the result of the set operation from each table.
UNION includes every record in both tables, but DOES NOT double count those that are in both tables.
UNION ALL includes every record in both tables and DOES replicate those that are in both tables, represented by the black center
The two diagrams on the bottom represent only the subsets of data being selected.
INTERSECT results in only those records found in both of the tables.
EXCEPT results in only those records in one table, BUT NOT the other.
Given two tables with name left_one and right_one, one corresponds to each table having one field.
If you run a UNION on these two fields, the result is each record appearing in either table, but notice the id values of 1 and 4 in right_one, are not included again in the UNION since they were already found in the left_one table.
UNION ALL includes all duplicates in its result, resulting in 8 total records for the example.

UNION & UNION ALL example

monarchs table in the leaders database
Use UNION on the prime_ministers and monarchs tables
all prime ministers and monarchs

  
SELECT prime_minister AS leader, country
FROM leaders.prime_ministers
UNION
SELECT monarch, country
FROM leaders.monarchs
ORDER BY country;

Note that the prime_minister field has been aliased as leader. The resulting field from the UNION will have the name leader.
This is an important property of set theory clauses
The fields included in the operation must be of the same data type since they are returned as a single field.
A number field can’t be stacked on top of a character field.
Spain and Norway have a prime minister and a monarch, while Brunei and Oman have a monarch who also acts as a prime minister.

  
SELECT prime_minister AS leader, country
FROM leaders.prime_ministers
UNION ALL
SELECT monarch, country
FROM leaders.monarchs
ORDER BY country;

UNION and UNION ALL clauses do not do the lookup step that JOINs do, they stack records on top of each other from one table to the next.

  
sql_stmt = "\
SELECT * \
FROM leaders.monarchs; \
"

pd.read_sql(sql_stmt, conn)

	country	continent	monarch
0	Brunei	Asia	Hassanal Bolkiah
1	Oman	Asia	Qaboos bin Said al Said
2	Norway	Europe	Harald V
3	Spain	Europe	Felipe VI

  
sql_stmt = "\
SELECT prime_minister AS leader, country \
FROM leaders.prime_ministers \
UNION \
SELECT monarch, country \
FROM leaders.monarchs \
ORDER BY country; \
"

pd.read_sql(sql_stmt, conn)

	leader	country
0	Malcolm Turnbull	Australia
1	Hassanal Bolkiah	Brunei
2	Sherif Ismail	Egypt
3	Jack Guy Lafontant	Haiti
4	Narendra Modi	India
5	Erna Solberg	Norway
6	Harald V	Norway
7	Qaboos bin Said al Said	Oman
8	Antonio Costa	Portugal
9	Mariano Rajoy	Spain
10	Felipe VI	Spain
11	Nguyen Xuan Phuc	Vietnam

  
sql_stmt = "\
SELECT prime_minister AS leader, country \
FROM leaders.prime_ministers \
UNION ALL \
SELECT monarch, country \
FROM leaders.monarchs \
ORDER BY country; \
"

pd.read_sql(sql_stmt, conn)

	leader	country
0	Malcolm Turnbull	Australia
1	Hassanal Bolkiah	Brunei
2	Hassanal Bolkiah	Brunei
3	Sherif Ismail	Egypt
4	Jack Guy Lafontant	Haiti
5	Narendra Modi	India
6	Erna Solberg	Norway
7	Harald V	Norway
8	Qaboos bin Said al Said	Oman
9	Qaboos bin Said al Said	Oman
10	Antonio Costa	Portugal
11	Felipe VI	Spain
12	Mariano Rajoy	Spain
13	Nguyen Xuan Phuc	Vietnam

Exercises

UNION

Near query result to the right, you will see two new tables with names economies2010 and economies2015.

Instructions

Combine these two tables into one table containing all of the fields in economies2010. The economies table is also included for reference.
Sort this resulting single table by country code and then by year, both in ascending order.

  
-- Select fields from 2010 table
SELECT *
  -- From 2010 table
  FROM countries.economies2010
    -- Set theory clause
    UNION
-- Select fields from 2015 table
SELECT *
  -- From 2015 table
  FROM countries.economies2015
-- Order by code and year
ORDER BY code, year;

  
sql_stmt = "\
SELECT * \
  FROM countries.economies2010 \
    UNION \
SELECT * \
  FROM countries.economies2015 \
ORDER BY code, year; \
"

pd.read_sql(sql_stmt, conn)

	code	year	income_group	gross_savings
0	AFG	2010	Low income	37.133
1	AFG	2015	Low income	21.466
2	AGO	2010	Upper middle income	23.534
3	AGO	2015	Upper middle income	-0.425
4	ALB	2010	Upper middle income	20.011
...	...	...	...	...
375	ZAF	2015	Upper middle income	16.460
376	ZMB	2010	Lower middle income	37.404
377	ZMB	2015	Lower middle income	39.177
378	ZWE	2010	Low income	16.109
379	ZWE	2015	Low income	5.563

380 rows × 4 columns

UNION (2)

UNION can also be used to determine all occurrences of a field across multiple tables. Try out this exercise with no starter code.

Instructions

Determine all (non-duplicated) country codes in either the cities or the currencies table. The result should be a table with only one field called country_code.
Sort by country_code in alphabetical order.

  
-- Select field
SELECT country_code
  -- From cities
  FROM countries.cities
    -- Set theory clause
    UNION
-- Select field
SELECT code
  -- From currencies
  FROM countries.currencies
-- Order by country_code
ORDER BY country_code;

  
sql_stmt = "\
SELECT country_code \
  FROM countries.cities \
    UNION \
SELECT code \
  FROM countries.currencies \
ORDER BY country_code; \
"

country_codes = pd.read_sql(sql_stmt, conn)
country_codes.head()

	country_code
0	ABW
1	AFG
2	AGO
3	AIA
4	ALB

  
country_codes.tail()

	country_code
200	WSM
201	YEM
202	ZAF
203	ZMB
204	ZWE

UNION ALL

As you saw, duplicates were removed from the previous two exercises by using UNION.

To include duplicates, you can use UNION ALL.

Instructions

Determine all combinations (include duplicates) of country code and year that exist in either the economies or the populations tables. Order by code then year.
The result of the query should only have two columns/fields. Think about how many records this query should result in.
You’ll use code very similar to this in your next exercise after the video. Make note of this code after completing it.

  
-- Select fields
SELECT code, year
  -- From economies
  FROM countries.economies
    -- Set theory clause
    UNION ALL
-- Select fields
SELECT country_code, year
  -- From populations
  FROM countries.populations
-- Order by code, year
ORDER BY code, year;

  
sql_stmt = "\
SELECT code, year \
  FROM countries.economies \
    UNION ALL \
SELECT country_code, year \
  FROM countries.populations \
ORDER BY code, year; \
"

country_codes_year = pd.read_sql(sql_stmt, conn)
country_codes_year.head()

	code	year
0	ABW	2010
1	ABW	2015
2	AFG	2010
3	AFG	2010
4	AFG	2015

  
country_codes_year.tail()

	code	year
809	ZMB	2015
810	ZWE	2010
811	ZWE	2010
812	ZWE	2015
813	ZWE	2015

INTERSECTional data science

The set theory clause INTERSECT works in a similar fashion to UNION and UNION ALL, but remember from the Venn diagram, INTERSECT only includes those records in common to both tables and fields selected.

  
SELECT id
FROM left_one
INTERSECT
SELECT id
FROM right_one;

The result only includes records common to the tables selected
Determine countries with both a prime minister and president
The code for each of these set operations has a similar layout.
First select which fields to include from the first table, and then specify the name of the first table.
Specify the set operation to perform
Lastly, denote which fields to include from the second table, and then the name of the second table.

  
SELECT country
FROM leaders.prime_ministers
INTERSECT
SELECT country
FROM leaders.presidents

What happens if two columns are selected, instead of one?

  
SELECT country, prime_minister as leader
FROM leaders.prime_ministers
INTERSECT
SELECT country, president
FROM leaders.presidents

Will this also give you the names of the countries with both type of leaders?
This results in an empty table.
When INTERSECT looks at two columns, it includes both columns in the search.
It didn’t find any countries with prime ministers AND presidents having the same name.
INTERSECT looks for records in common, not individual key fields like what a join does to match.

  
sql_stmt = "\
SELECT country \
FROM leaders.prime_ministers \
INTERSECT \
SELECT country \
FROM leaders.presidents \
"

pd.read_sql(sql_stmt, conn)

	country
0	Portugal
1	Egypt
2	Haiti
3	Vietnam

Exercises

INTERSECT

Repeat the previous UNION ALL exercise, this time looking at the records in common for country code and year for the economies and populations tables.

Instructions

Again, order by code and then by year, both in ascending order.
Note the number of records here (given at the bottom of query result) compared to the similar UNION ALL query result (814 records).

  
-- Select fields
SELECT code, year
  -- From economies
  FROM countries.economies
    -- Set theory clause
    INTERSECT
-- Select fields
SELECT country_code, year
  -- From populations
  FROM countries.populations
-- Order by code and year
ORDER BY code, year;

  
sql_stmt = "\
SELECT code, year \
  FROM countries.economies \
    INTERSECT \
SELECT country_code, year \
  FROM countries.populations \
ORDER BY code, year; \
"

pd.read_sql(sql_stmt, conn)

	code	year
0	AFG	2010
1	AFG	2015
2	AGO	2010
3	AGO	2015
4	ALB	2010
...	...	...
375	ZAF	2015
376	ZMB	2010
377	ZMB	2015
378	ZWE	2010
379	ZWE	2015

380 rows × 2 columns

INTERSECT (2)

As you think about major world cities and their corresponding country, you may ask which countries also have a city with the same name as their country name?

Instructions

Use INTERSECT to answer this question with countries and cities!

  
-- Select fields
SELECT name
  -- From countries
  FROM countries.countries
    -- Set theory clause
    INTERSECT
-- Select fields
SELECT name
  -- From cities
  FROM countries.cities;

  
sql_stmt = "\
SELECT name \
  FROM countries.countries \
    INTERSECT \
SELECT name \
  FROM countries.cities; \
"

pd.read_sql(sql_stmt, conn)

	name
0	Singapore
1	Hong Kong

Hong Kong is part of China, but it appears separately here because it has its own ISO country code. Depending upon your analysis, treating Hong Kong separately could be useful or a mistake. Always check your dataset closely before you perform an analysis!

Review UNION and INTERSECT

Which of the following combinations of terms and definitions is correct?

Answer the question

~~UNION: returns all records (potentially duplicates) in both tables~~
~~UNION ALL: returns only unique records~~
**INTERSECT: returns only records appearing in both tables**
~~None of the above are matched correctly~~

EXCEPTional

EXCEPT includes only the records in one table, but not in the other.
There are some monarchs that also act as the prime minister. One way to determine those monarchs in the monarchs table that do not also hold the title prime minister, is to use the EXCEPT clause.

  
SELECT monarch, country
FROM leaders.monarchs
EXCEPT
SELECT prime_minister, country
FROM leaders.prime_ministers;

This SQL query selects the monarch field from monarchs, then looks for common entries with the prime_ministers field, while also keeping track of the country for each leader.
Only the two European monarchs are not also prime ministers in the leaders database.
Only the records that appear in the left table, BUT DO NOT appear in the right table are included.

  
sql_stmt = "\
SELECT monarch, country \
FROM leaders.monarchs \
EXCEPT \
SELECT prime_minister, country \
FROM leaders.prime_ministers; \
"

pd.read_sql(sql_stmt, conn)

	monarch	country
0	Harald V	Norway
1	Felipe VI	Spain

Exercises

EXCEPT

Get the names of cities in cities which are not noted as capital cities in countries as a single field result.

Note that there are some countries in the world that are not included in the countries table, which will result in some cities not being labeled as capital cities when in fact they are.

Instructions

Order the resulting field in ascending order.
Can you spot the city/cities that are actually capital cities which this query misses?

  
-- Select field
SELECT name
  -- From cities
  FROM countries.cities
    -- Set theory clause
    EXCEPT
-- Select field
SELECT capital
  -- From countries
  FROM countries.countries
-- Order by result
ORDER BY name;

  
sql_stmt = "\
SELECT name \
  FROM countries.cities \
    EXCEPT \
SELECT capital \
  FROM countries.countries \
ORDER BY name; \
"

pd.read_sql(sql_stmt, conn).head()

	name
0	Abidjan
1	Ahmedabad
2	Alexandria
3	Almaty
4	Auckland

EXCEPT (2)

Now you will complete the previous query in reverse!

Determine the names of capital cities that are not listed in the cities table.

Instructions

Order by capital in ascending order.
The cities table contains information about 236 of the world’s most populous cities. The result of your query may surprise you in terms of the number of capital cities that DO NOT appear in this list!

  
-- Select field
SELECT capital
  -- From countries
  FROM countries.countries
    -- Set theory clause
    EXCEPT
-- Select field
SELECT name
  -- From cities
  FROM countries.cities
-- Order by ascending capital
ORDER BY capital;

  
sql_stmt = "\
SELECT capital \
  FROM countries.countries \
    EXCEPT \
SELECT name \
  FROM countries.cities \
ORDER BY capital; \
"

pd.read_sql(sql_stmt, conn).head()

	capital
0	Agana
1	Amman
2	Amsterdam
3	Andorra la Vella
4	Antananarivo

Semi-JOINs and Anti-JOINS

The previous six joins are all additive joins, in that they add columns to the original left table.
1. INNER JOIN
2. SELF JOIN
3. LEFT JOIN
4. RIGHT JOIN
5. FULL JOIN
6. CROSS JOIN
The last two joins use a right table to determine which records to keep in the left table.
Use these last to joins in a way similar to a WHERE clause dependent on the values of a second table.
semi-joins and anti-joins don’t have the same built-in SQL syntax that INNER JOIN and LEFT JOIN have.
semi-joins and anti-joins are useful tools in filtering table records on the records of another table.
The challenge will be to combine set theory clauses with semi-joins.
Determine the countries that gained independence before 1800.

  
SELECT name
FROM leaders.states
WHERE indep_year < 1800;

  
    country
0  Portugal
1     Spain

Determine president, country and continent

  
SELECT president, country, continent
FROM leaders.presidents

  
                 president   country      continent
   Abdel Fattah el-Sisi     Egypt         Africa
Marcelo Rebelo de Sousa  Portugal         Europe
          Jovenel Moise     Haiti  North America
            Jose Mujica   Uruguay  South America
  Ellen Johnson Sirleaf   Liberia         Africa
      Michelle Bachelet     Chile  South America
         Tran Dai Quang   Vietnam           Asia

`SEMI JOIN`

Determine the presidents of countries that gained independence before 1800.

  
SELECT president, country, continent
FROM leaders.presidents
WHERE country IN
    (SELECT name
     FROM leaders.states
     WHERE indep_year < 1800);

This is an example of a subquery, which is a query that sits inside another query.
Does this include the presidents of Spain and Portugal?
Since Spain does not have a president, it’s not included here and only the Portuguese president is listed.
The semi-join chooses records in the first table where a condition IS met in the second table.
The semi-join matches records by key field in the right table with those in the left.
It then picks out only the rows in the left table that match the condition.

  
sql_stmt = "\
SELECT president, country, continent \
FROM leaders.presidents \
WHERE country IN \
    (SELECT name \
     FROM leaders.states \
     WHERE indep_year < 1800); \
"

pd.read_sql(sql_stmt, conn)

	president	country	continent
0	Marcelo Rebelo de Sousa	Portugal	Europe

`ANTI JOIN`

An anti-join chooses records in the first table where a condition IS NOT met in the second table.
Determine countries in the Americas founded after 1800.
Use NOT to exclude those countries in the subquery.

  
SELECT president, country, continent
FROM leaders.presidents
WHERE continent LIKE '%America'
    AND country NOT IN
        (SELECT name
         FROM leaders.states
         WHERE indep_year < 1800);

The anti-join picks out those columns in the left table that do not match the condition on the right table.

  
sql_stmt = "\
SELECT president, country, continent \
FROM leaders.presidents \
WHERE continent LIKE '%%America' \
    AND country NOT IN \
        (SELECT name \
         FROM leaders.states \
         WHERE indep_year < 1800); \
"

pd.read_sql(sql_stmt, conn)

	president	country	continent
0	Jovenel Moise	Haiti	North America
1	Jose Mujica	Uruguay	South America
2	Michelle Bachelet	Chile	South America

Exercises

Semi-JOIN

You are now going to use the concept of a semi-join to identify languages spoken in the Middle East.

Instructions 1/3

Flash back to our Intro to SQL for Data Science course and begin by selecting all country codes in the Middle East as a single field result using SELECT, FROM, and WHERE.

  
-- Select code
SELECT code
  -- From countries
  FROM countries
-- Where region is Middle East
WHERE region = 'Middle East';

  
sql_stmt = "\
SELECT code \
FROM countries.countries \
WHERE region = 'Middle East'; \
"

pd.read_sql(sql_stmt, conn)

	code
0	ARE
1	ARM
2	AZE
3	BHR
4	GEO
5	IRQ
6	ISR
7	YEM
8	JOR
9	KWT
10	CYP
11	LBN
12	OMN
13	QAT
14	SAU
15	SYR
16	TUR
17	PSE

You are now going to use the concept of a semi-join to identify languages spoken in the Middle East.

Instructions 2/3

Comment out the answer to the previous tab by surrounding it in /* and */. You’ll come back to it!
Below the commented code, select only unique languages by name appearing in the languages table.
Order the resulting single field table by name in ascending order.

  
-- Select field
SELECT DISTINCT name
  -- From languages
  FROM languages
-- Order by name
ORDER BY name;

  
sql_stmt = "\
SELECT DISTINCT name \
FROM countries.languages \
ORDER BY name; \
"

pd.read_sql(sql_stmt, conn)

	name
0	Afar
1	Afrikaans
2	Akyem
3	Albanian
4	Alsatian
...	...
391	Yapese
392	Yoruba
393	Yue
394	Zezuru
395	Zulu

396 rows × 1 columns

You are now going to use the concept of a semi-join to identify languages spoken in the Middle East.

Instructions 3/3

Now combine the previous two queries into one query:

Add a WHERE IN statement to the ELECT DISTINCT query, and use the commented out query from the first instruction in there. That way, you can determine the unique languages spoken in the Middle East.

Carefully review this result and its code after completing it. It serves as a great example of subqueries, which are the focus of Chapter 4.

  
-- Select distinct fields
SELECT DISTINCT name
  -- From languages
  FROM languages
-- Where in statement
WHERE code IN
  -- Subquery
  (SELECT  code
   FROM countries
   WHERE region = 'Middle East')
-- Order by name
ORDER BY name;

  
sql_stmt = "\
SELECT DISTINCT name \
FROM countries.languages \
WHERE code IN \
 (SELECT code \
  FROM countries.countries \
  WHERE region = 'Middle East') \
ORDER BY name; \
"

pd.read_sql(sql_stmt, conn)

	name
0	Arabic
1	Aramaic
2	Armenian
3	Azerbaijani
4	Azeri
5	Baluchi
6	Bulgarian
7	Circassian
8	English
9	Farsi
10	Filipino
11	French
12	Georgian
13	Greek
14	Hebrew
15	Hindi
16	Indian
17	Kurdish
18	Other
19	Persian
20	Romanian
21	Russian
22	Syriac
23	Turkish
24	Turkmen
25	unspecified
26	Urdu

Relating semi-JOIN to a tweaked INNER JOIN

Let’s revisit the code from the previous exercise, which retrieves languages spoken in the Middle East.

  
SELECT DISTINCT name
FROM languages
WHERE code IN
  (SELECT code
   FROM countries
   WHERE region = 'Middle East')
ORDER BY name;

Sometimes problems solved with semi-joins can also be solved using an inner join.

  
SELECT languages.name AS language
FROM languages
INNER JOIN countries
ON languages.code = countries.code
WHERE region = 'Middle East'
ORDER BY language;

This inner join isn’t quite right. What is missing from this second code block to get it to match with the correct answer produced by the first block?

Possible Answers

~~HAVING instead of WHERE~~
**DISTINCT**
~~UNIQUE~~

  
sql_stmt = "\
SELECT DISTINCT languages.name AS language \
FROM countries.languages \
INNER JOIN countries.countries \
ON languages.code = countries.code \
WHERE region = 'Middle East' \
ORDER BY language; \
"

pd.read_sql(sql_stmt, conn)

	language
0	Arabic
1	Aramaic
2	Armenian
3	Azerbaijani
4	Azeri
5	Baluchi
6	Bulgarian
7	Circassian
8	English
9	Farsi
10	Filipino
11	French
12	Georgian
13	Greek
14	Hebrew
15	Hindi
16	Indian
17	Kurdish
18	Other
19	Persian
20	Romanian
21	Russian
22	Syriac
23	Turkish
24	Turkmen
25	unspecified
26	Urdu

Diagnosing problems using anti-JOIN

Another powerful join in SQL is the anti-join. It is particularly useful in identifying which records are causing an incorrect number of records to appear in join queries.

You will also see another example of a subquery here, as you saw in the first exercise on semi-joins. Your goal is to identify the currencies used in Oceanian countries!

Instructions 1/3

Begin by determining the number of countries in countries that are listed in Oceania using SELECT, FROM, and WHERE.

  
-- Select statement
SELECT count(name)
  -- From countries
  FROM countries
-- Where continent is Oceania
WHERE continent = 'Oceania';

  
sql_stmt = "\
SELECT count(name) \
  FROM countries.countries \
WHERE continent = 'Oceania'; \
"

pd.read_sql(sql_stmt, conn)

	count
0	19

Instructions 2/3

Complete an inner join with countries AS c1 on the left and currencies AS c2 on the right to get the different currencies used in the countries of Oceania.
Match ON the code field in the two tables.
Include the country code, country name, and basic_unit AS currency.

Observe query result and make note of how many different countries are listed here.

  
-- 5. Select fields (with aliases)
SELECT c1.code, c1.name, c2.basic_unit as currency
  -- 1. From countries (alias as c1)
  FROM countries as c1
    -- 2. Join with currencies (alias as c2)
    INNER JOIN currencies as c2
    -- 3. Match on code
    ON c1.code = c2.code
-- 4. Where continent is Oceania
WHERE continent = 'Oceania';

  
sql_stmt = "\
SELECT c1.code, c1.name, c2.basic_unit as currency \
  FROM countries.countries as c1 \
    INNER JOIN countries.currencies as c2 \
        ON c1.code = c2.code \
WHERE continent = 'Oceania'; \
"

pd.read_sql(sql_stmt, conn)

	code	name	currency
0	AUS	Australia	Australian dollar
1	PYF	French Polynesia	CFP franc
2	KIR	Kiribati	Australian dollar
3	MHL	Marshall Islands	United States dollar
4	NRU	Nauru	Australian dollar
5	NCL	New Caledonia	CFP franc
6	NZL	New Zealand	New Zealand dollar
7	PLW	Palau	United States dollar
8	PNG	Papua New Guinea	Papua New Guinean kina
9	WSM	Samoa	Samoan tala
10	SLB	Solomon Islands	Solomon Islands dollar
11	TON	Tonga	Tongan paʻanga
12	TUV	Tuvalu	Australian dollar
13	TUV	Tuvalu	Tuvaluan dollar
14	VUT	Vanuatu	Vanuatu vatu

Instructions 3/3

Note that not all countries in Oceania were listed in the resulting inner join with currencies. Use an anti-join to determine which countries were not included!

Use NOT IN and (SELECT code FROM currencies) as a subquery to get the country code and country name for the Oceanian countries that are not included in the currencies table.

  
-- 3. Select fields
SELECT code, name
  -- 4. From Countries
  FROM countries
  -- 5. Where continent is Oceania
  WHERE continent = 'Oceania'
    -- 1. And code not in
    AND code NOT IN
    -- 2. Subquery
    (SELECT code
     FROM currencies);

  
sql_stmt = "\
SELECT code, name \
  FROM countries.countries \
  WHERE continent = 'Oceania' \
    AND code NOT IN \
    (SELECT code \
     FROM countries.currencies); \
"

pd.read_sql(sql_stmt, conn)

	code	name
0	ASM	American Samoa
1	FJI	Fiji Islands
2	GUM	Guam
3	FSM	Micronesia, Federated States of
4	MNP	Northern Mariana Islands

Set theory challenge

Congratulations! You’ve now made your way to the challenge problem for this third chapter. Your task here will be to incorporate two of UNION/UNION ALL/INTERSECT/EXCEPT to solve a challenge involving three tables.

In addition, you will use a subquery as you have in the last two exercises! This will be great practice as you hop into subqueries more in Chapter 4!

Instructions

Identify the country codes that are included in either economies or currencies but not in populations.
Use that result to determine the names of cities in the countries that match the specification in the previous instruction.

  
-- Select the city name
SELECT country_code, name
  -- Alias the table where city name resides
  FROM cities AS c1
  -- Choose only records matching the result of multiple set theory clauses
  WHERE country_code IN
(
    -- Select appropriate field from economies AS e
    SELECT e.code
    FROM economies AS e
    -- Get all additional (unique) values of the field from currencies AS c2  
    UNION ALL
    SELECT c2.code
    FROM currencies AS c2
    -- Exclude those appearing in populations AS p
    EXCEPT
    SELECT p.country_code
    FROM populations AS p
);

  
sql_stmt = "\
SELECT country_code, name \
    FROM countries.cities AS c1 \
    WHERE country_code IN \
(   SELECT e.code \
    FROM countries.economies AS e \
    UNION ALL \
    SELECT c2.code \
    FROM countries.currencies AS c2 \
    EXCEPT \
    SELECT p.country_code \
    FROM countries.populations AS p); \
"

pd.read_sql(sql_stmt, conn)

	country_code	name
0	ROM	Bucharest
1	TWN	Kaohsiung
2	TWN	New Taipei City
3	TWN	Taichung
4	TWN	Tainan
5	TWN	Taipei

Subqueries

In this closing chapter, you’ll learn how to use nested queries to add some finesse to your data insights. You’ll also wrap all of the content covered throughout this course into solving three challenge problems.

Subqueries inside WHERE and SELECT clauses

The most common type of subquery is one inside of a WHERE statement.
Examples include semi-join and anti-join

leaders.states

  
SELECT name, indep_year, fert_rate, women_parli_perc
FROM leaders.states;

Average fert_rate

  
SELECT AVG(fert_rate)
FROM leaders.states;

Asian countries below average fert_rate

  
SELECT name, fert_rate
FROM leaders.states
WHERE continent = 'Asia'
AND fert_rate < 
(SELECT AVG(fert_rate)
 FROM leaders.states);

  
sql_stmt = "\
SELECT name, fert_rate \
FROM leaders.states \
WHERE continent = 'Asia' \
AND fert_rate <  \
(SELECT AVG(fert_rate) \
 FROM leaders.states); \
"

pd.read_sql(sql_stmt, conn)

	name	fert_rate
0	Brunei	1.96
1	Vietnam	1.70

The second most common type of subquery is inside of a SELECT clause.

Count the number of countries listed in states table for each continent in the prime_ministers table.

Continents in the prime_ministers table

  
SELECT DISTINCT continent
FROM prime_ministers;

Determine the counts of the number of countries in states for each of the continents in the last slide

  
SELECT count(name)
FROM leaders.states
WHERE continent IN
(SELECT DISTINCT continent
 FROM leaders.prime_ministers);
--Returns total count of countries

  
SELECT DISTINCT continent,
(SELECT COUNT(*)
 FROM leaders.states
 WHERE prime_ministers.continent = states.continent) as countries_num
FROM leaders.prime_ministers

The subquery involving states, can also reference the prime_ministers table in the main query.
Anytime you do a subquery inside a SELECT statement, you need to give the subquery an alias (e.g. countries_num in the example)
There are numerous ways to solve problems with SQL queries.
A carefully constructed JOIN could achieve this same result.

  
sql_stmt = "\
SELECT DISTINCT continent, \
(SELECT COUNT(*) \
 FROM leaders.states \
 WHERE prime_ministers.continent = states.continent) as countries_num \
FROM leaders.prime_ministers \
"

pd.read_sql(sql_stmt, conn)

	continent	countries_num
0	Africa	2
1	Asia	4
2	Europe	3
3	North America	1
4	Oceania	1

Exercises

Subquery inside WHERE

You’ll now try to figure out which countries had high average life expectancies (at the country level) in 2015.

Instructions 1/2

Begin by calculating the average life expectancy across all countries for 2015.

  
-- Select average life_expectancy
SELECT avg(life_expectancy)
  -- From populations
  FROM populations
-- Where year is 2015
WHERE year = 2015;

  
sql_stmt = "\
SELECT avg(life_expectancy) \
  FROM countries.populations \
WHERE year = 2015; \
"

pd.read_sql(sql_stmt, conn)

	avg
0	71.676342

Instructions 2/2

Recall that you can use SQL to do calculations for you. Suppose we wanted only records that were above 1.15 * 100 in terms of life expectancy for 2015:

  
SELECT *
  FROM populations
WHERE life_expectancy > 1.15 * 100
  AND year = 2015;

Select all fields from populations with records corresponding to larger than 1.15 times the average you calculated in the first task for 2015. In other words, change the 100 in the example above with a subquery.

  
-- Select fields
SELECT *
  -- From populations
  FROM populations
-- Where life_expectancy is greater than
WHERE life_expectancy >
  -- 1.15 * subquery
  1.15 *
   (SELECT avg(life_expectancy)
   FROM populations
   WHERE year = 2015)
  AND year = 2015;

  
sql_stmt = "\
SELECT * \
  FROM countries.populations \
WHERE life_expectancy > \
  1.15 * \
   (SELECT avg(life_expectancy) \
   FROM countries.populations \
   WHERE year = 2015) \
  AND year = 2015; \
"

pd.read_sql(sql_stmt, conn)

	pop_id	country_code	year	fertility_rate	life_expectancy	size
0	21	AUS	2015	1.833	82.451220	23789752.0
1	376	CHE	2015	1.540	83.197560	8281430.0
2	356	ESP	2015	1.320	83.380486	46443992.0
3	134	FRA	2015	2.010	82.670730	66538392.0
4	170	HKG	2015	1.195	84.278046	7305700.0
5	174	ISL	2015	1.930	82.860980	330815.0
6	190	ITA	2015	1.370	83.490240	60730584.0
7	194	JPN	2015	1.460	83.843660	126958470.0
8	340	SGP	2015	1.240	82.595120	5535002.0
9	374	SWE	2015	1.880	82.551216	9799186.0

Subquery inside WHERE (2)

Use your knowledge of subqueries in WHERE to get the urban area population for only capital cities.

Instructions

Make use of the capital field in the countries table in your subquery.
Select the city name, country code, and urban area population fields.

  
-- 2. Select fields
SELECT name, country_code, urbanarea_pop
  -- 3. From cities
  FROM cities
-- 4. Where city name in the field of capital cities
WHERE name IN
  -- 1. Subquery
  (SELECT capital
   FROM countries)
ORDER BY urbanarea_pop DESC;

  
sql_stmt = "\
SELECT name, country_code, urbanarea_pop \
  FROM countries.cities \
WHERE name IN \
  (SELECT capital \
   FROM countries.countries) \
ORDER BY urbanarea_pop DESC; \
"

pd.read_sql(sql_stmt, conn)

	name	country_code	urbanarea_pop
0	Beijing	CHN	21516000.0
1	Dhaka	BGD	14543124.0
2	Tokyo	JPN	13513734.0
3	Moscow	RUS	12197596.0
4	Cairo	EGY	10230350.0
5	Kinshasa	COD	10130000.0
6	Jakarta	IDN	10075310.0
7	Seoul	KOR	9995784.0
8	Mexico City	MEX	8974724.0
9	Lima	PER	8852000.0
10	London	GBR	8673713.0
11	Bangkok	THA	8280925.0
12	Tehran	IRN	8154051.0
13	Bogota	COL	7878783.0
14	Baghdad	IRQ	7180889.0
15	Hanoi	VNM	6844100.0
16	Santiago	CHL	5743719.0
17	Riyadh	SAU	5676621.0
18	Singapore	SGP	5535000.0
19	Ankara	TUR	5271000.0
20	Khartoum	SDN	3639598.0
21	Berlin	DEU	3517424.0
22	Algiers	DZA	3415811.0
23	Kabul	AFG	3414100.0
24	Pyongyang	PRK	3255388.0
25	Madrid	ESP	3207247.0
26	Baku	AZE	3202300.0
27	Nairobi	KEN	3138369.0
28	Addis Ababa	ETH	3103673.0
29	Buenos Aires	ARG	3054300.0
30	Kiev	UKR	2908703.0
31	Rome	ITA	2877215.0
32	Luanda	AGO	2825311.0
33	Quito	ECU	2671191.0
34	Managua	NIC	2560789.0
35	Brasilia	BRA	2556149.0
36	Yaounde	CMR	2440462.0
37	Tashkent	UZB	2309600.0
38	Phnom Penh	KHM	2234566.0
39	Paris	FRA	2229621.0
40	Ouagadougou	BFA	2200000.0
41	Guatemala City	GTM	2110100.0
42	Havana	CUB	2106146.0
43	Accra	GHA	2070463.0
44	Minsk	BLR	1959781.0
45	Caracas	VEN	1943901.0
46	Sana'a	YEM	1937451.0
47	Islamabad	PAK	1900000.0
48	Vienna	AUT	1863881.0
49	Brazzaville	COG	1827000.0
50	Manila	PHL	1780148.0
51	Kuala Lumpur	MYS	1768000.0
52	Maputo	MOZ	1766184.0
53	Budapest	HUN	1759407.0
54	Warsaw	POL	1753977.0
55	Lusaka	ZMB	1742979.0
56	Harare	ZWE	1606000.0
57	Kampala	UGA	1507080.0
58	Prague	CZE	1324000.0
59	Montevideo	URY	1305082.0
60	Abuja	NGA	1235880.0
61	Dakar	SEN	1146053.0
62	Abu Dhabi	ARE	1145000.0
63	Tripoli	LBY	1126000.0
64	Yerevan	ARM	1060138.0
65	Tunis	TUN	1056247.0

Subquery inside SELECT

In this exercise, you’ll see how some queries can be written using either a join or a subquery.

You have seen previously how to use GROUP BY with aggregate functions and an inner join to get summarized information from multiple tables.

The code given in query.sql selects the top nine countries in terms of number of cities appearing in the cities table. Recall that this corresponds to the most populous cities in the world. Your task will be to convert the commented out code to get the same result as the code shown.

Instructions 1/2

Just Submit Answer here!

  
SELECT countries.name AS country, COUNT(*) AS cities_num
  FROM cities
    INNER JOIN countries
    ON countries.code = cities.country_code
GROUP BY country
ORDER BY cities_num DESC, country
LIMIT 9;

  
sql_stmt = "\
SELECT countries.name AS country, COUNT(*) AS cities_num \
  FROM countries.cities \
    INNER JOIN countries.countries \
    ON countries.code = cities.country_code \
GROUP BY country \
ORDER BY cities_num DESC, country \
LIMIT 9; \
"

pd.read_sql(sql_stmt, conn)

	country	cities_num
0	China	36
1	India	18
2	Japan	11
3	Brazil	10
4	Pakistan	9
5	United States	9
6	Indonesia	7
7	Russian Federation	7
8	South Korea	7

Instructions 2/2

Remove the comments around the second query and comment out the first query instead.
Convert the GROUP BY code to use a subquery inside of SELECT, i.e. fill in the blanks to get a result that matches the one given using the GROUP BY code in the first query.
Again, sort the result by cities_num descending and then by country ascending.

  
SELECT countries.name AS country,
  (SELECT count(*)
   FROM cities
   WHERE countries.code = cities.country_code) AS cities_num
FROM countries
ORDER BY cities_num DESC, country
LIMIT 9;

  
sql_stmt = "\
SELECT countries.name AS country, \
  (SELECT count(*) \
   FROM countries.cities \
   WHERE countries.code = cities.country_code) AS cities_num \
FROM countries.countries \
ORDER BY cities_num DESC, country \
LIMIT 9; \
"

pd.read_sql(sql_stmt, conn)

	country	cities_num
0	China	36
1	India	18
2	Japan	11
3	Brazil	10
4	Pakistan	9
5	United States	9
6	Indonesia	7
7	Russian Federation	7
8	South Korea	7

Subquery inside FROM clause

The last basic type of subquery exists inside of a FROM clause. Determine the maximum percentage of women in parliament for each continent listing in leaders.states

  
SELECT continent, MAX(women_parli_perc) AS max_perc
FROM states
GROUP BY continent
ORDER BY continent;

This query will only work if continent is included as one of th fields in the SELECT clause, since we are grouping based on that field.

  
sql_stmt = "\
SELECT continent, MAX(women_parli_perc) AS max_perc \
FROM leaders.states \
GROUP BY continent \
ORDER BY continent; \
"

pd.read_sql(sql_stmt, conn)

	continent	max_perc
0	Africa	14.90
1	Asia	24.00
2	Europe	39.60
3	North America	2.74
4	Oceania	32.74
5	South America	22.31

Focusing on records in monarchs

Multiple tables can be included in the FROM clause, by adding a comma between them

  
SELECT monarchs.continent
FROM monarchs, states
WHERE monarchs.continent = states.continent
ORDER BY continent;

Produces part of the answer; how should duplicates be removed?

  
sql_stmt = "\
SELECT monarchs.continent \
FROM leaders.monarchs, leaders.states \
WHERE monarchs.continent = states.continent \
ORDER BY continent; \
"

pd.read_sql(sql_stmt, conn)

	continent
0	Asia
1	Asia
2	Asia
3	Asia
4	Asia
5	Asia
6	Asia
7	Asia
8	Europe
9	Europe
10	Europe
11	Europe
12	Europe
13	Europe

Finishing the subquery

To get Asia and Europe to appear only once, use DISTINCT in the SELECT statement.

  
SELECT DISTINCT monarchs.continent, subquery.max_perc
FROM monarchs,
    (SELECT continent, MAX(women_parli_perc) AS max_perc
     FROM states
     GROUP BY continent) AS subquery
WHERE monarchs.continent = subquery.continent
ORDER BY continent;

How is the max_perc column included with continent?
Instead of including states in the FROM clause, include the subquery instead and alias it with a name like subquery.
This is how to include a subquery as a temporary table in the FROM clause.

  
sql_stmt = "\
SELECT DISTINCT monarchs.continent, subquery.max_perc \
FROM leaders.monarchs, \
    (SELECT continent, MAX(women_parli_perc) AS max_perc \
     FROM leaders.states \
     GROUP BY continent) AS subquery \
WHERE monarchs.continent = subquery.continent \
ORDER BY continent; \
"

pd.read_sql(sql_stmt, conn)

	continent	max_perc
0	Asia	24.0
1	Europe	39.6

Exercises

Subquery inside FROM

The last type of subquery you will work with is one inside of FROM.

You will use this to determine the number of languages spoken for each country, identified by the country’s local name! (Note this may be different than the name field and is stored in the local_name field.)

Instructions 1/2

Begin by determining for each country code how many languages are listed in the languages table using SELECT, FROM, and GROUP BY. Alias the aggregated field as lang_num.

  
-- Select fields (with aliases)
SELECT code, count(name) AS lang_num
  -- From languages
  FROM languages
-- Group by code
GROUP BY code;

  
sql_stmt = "\
SELECT code, count(name) AS lang_num \
FROM countries.languages \
GROUP BY code \
ORDER BY lang_num DESC; \
"

lang_count = pd.read_sql(sql_stmt, conn)
print(lang_count.head())
print(lang_count.tail())

  code  lang_num
ZMB        19
ZWE        16
ETH        16
IND        14
NPL        14
    code  lang_num
COL         1
AIA         1
DOM         1
SAU         1
PRK         1

Instructions 2/2

Include the previous query (aliased as subquery) as a subquery in the FROM clause of a new query.
Select the local name of the country from countries.
Also, select lang_num from subquery.
Make sure to use WHERE appropriately to match code in countries and in subquery.
Sort by lang_num in descending order.

  
-- Select fields
SELECT local_name, subquery.lang_num
  -- From countries
  FROM countries,
    -- Subquery (alias as subquery)
    (SELECT code, count(name) as lang_num
     FROM languages
     GROUP BY code) AS subquery
  -- Where codes match
  WHERE countries.code = subquery.code
-- Order by descending number of languages
ORDER BY lang_num DESC;

  
sql_stmt = "\
SELECT local_name, subquery.lang_num \
  FROM countries.countries, \
    (SELECT code, count(name) as lang_num \
     FROM countries.languages \
     GROUP BY code) AS subquery \
  WHERE countries.code = subquery.code \
ORDER BY lang_num DESC; \
"

lang_count = pd.read_sql(sql_stmt, conn)
print(lang_count.head())
print(lang_count.tail())

     local_name  lang_num
      Zambia        19
 YeItyop´iya        16
    Zimbabwe        16
Bharat/India        14
       Nepal        14
                       local_name  lang_num
        Republica Dominicana         1
The Turks and Caicos Islands         1
                   Nederland         1
              United Kingdom         1
                      Brasil         1

Advanced subquery

You can also nest multiple subqueries to answer even more specific questions.

In this exercise, for each of the six continents listed in 2015, you’ll identify which country had the maximum inflation rate (and how high it was) using multiple subqueries. The table result of your query in Task 3 should look something like the following, where anything between < > will be filled in with appropriate values:

+------------+---------------+-------------------+
| name       | continent     | inflation_rate    |
|------------+---------------+-------------------|
| <country1> | North America | <max_inflation1>  |
| <country2> | Africa        | <max_inflation2>  |
| <country3> | Oceania       | <max_inflation3>  |
| <country4> | Europe        | <max_inflation4>  |
| <country5> | South America | <max_inflation5>  |
| <country6> | Asia          | <max_inflation6>  |
+------------+---------------+-------------------+

Again, there are multiple ways to get to this solution using only joins, but the focus here is on showing you an introduction into advanced subqueries.

Instructions 1/3

Create an inner join with countries on the left and economies on the right with USING. Do not alias your tables or columns.
Retrieve the country name, continent, and inflation rate for 2015.

  
-- Select fields
SELECT name, continent, inflation_rate
  -- From countries
  FROM countries
    -- Join to economies
    INNER JOIN economies
    -- Match on code
    USING (code)
-- Where year is 2015
WHERE year = 2015;

  
sql_stmt = "\
SELECT name, continent, inflation_rate \
  FROM countries.countries \
    INNER JOIN countries.economies \
    USING (code) \
WHERE year = 2015; \
"

inf_rate = pd.read_sql(sql_stmt, conn)
print(inf_rate.head())
print(inf_rate.tail())

                   name      continent  inflation_rate
         Afghanistan           Asia          -1.549
              Angola         Africa          10.287
             Albania         Europe           1.896
United Arab Emirates           Asia           4.070
           Argentina  South America             NaN
             name continent  inflation_rate
       Samoa   Oceania           1.923
       Yemen      Asia          39.403
South Africa    Africa           4.575
      Zambia    Africa          10.107
    Zimbabwe    Africa          -2.410

Instructions 2/3

Determine the maximum inflation rate for each continent in 2015 using the previous query as a subquery called subquery in the FROM clause.
Select the maximum inflation rate AS max_inf grouped by continent.
This will result in the six maximum inflation rates in 2015 for the six continents as one field table. (Don’t include continent in the outer SELECT statement.)

  
-- Select fields
SELECT max(inflation_rate) as max_inf
  -- Subquery using FROM (alias as subquery)
  FROM (
      SELECT name, continent, inflation_rate
      FROM countries
      INNER JOIN economies
      USING (code)
      WHERE year = 2015) AS subquery
-- Group by continent
GROUP BY continent;

  
sql_stmt = "\
SELECT max(inflation_rate) as max_inf \
  FROM ( \
      SELECT name, continent, inflation_rate \
      FROM countries.countries \
      INNER JOIN countries.economies \
      USING (code) \
      WHERE year = 2015) AS subquery \
GROUP BY continent; \
"

pd.read_sql(sql_stmt, conn)

	max_inf
0	21.858
1	39.403
2	121.738
3	7.524
4	48.684
5	9.784

Instructions 3/3

Append the second part’s query to the first part’s query using WHERE, AND, and IN to obtain the name of the country, its continent, and the maximum inflation rate for each continent in 2015. Revisit the sample output in the assignment text at the beginning of the exercise to see how this matches up.
For the sake of practice, change all joining conditions to use ON instead of USING.
This code works since each of the six maximum inflation rate values occur only once in the 2015 data. Think about whether this particular code involving subqueries would work in cases where there are ties for the maximum inflation rate values.

  
-- Select fields
SELECT name, continent, inflation_rate
  -- From countries
  FROM countries
    -- Join to economies
    INNER JOIN economies
    -- Match on code
    ON countries.code = economies.code
  -- Where year is 2015
  WHERE year = 2015
    -- And inflation rate in subquery (alias as subquery)
    AND inflation_rate IN (
         SELECT max(inflation_rate) as max_inf
         FROM (
               SELECT name, continent, inflation_rate
               FROM countries
               INNER JOIN economies
               ON countries.code = economies.code
               WHERE year = 2015) AS subquery
        GROUP BY continent);

  
sql_stmt = "\
SELECT name, continent, inflation_rate \
  FROM countries.countries \
    INNER JOIN countries.economies \
    ON countries.code = economies.code \
  WHERE year = 2015 \
    AND inflation_rate IN ( \
         SELECT max(inflation_rate) as max_inf \
         FROM ( \
               SELECT name, continent, inflation_rate \
               FROM countries.countries \
               INNER JOIN countries.economies \
               ON countries.code = economies.code \
               WHERE year = 2015) AS subquery \
        GROUP BY continent); \
"

pd.read_sql(sql_stmt, conn)

	name	continent	inflation_rate
0	Haiti	North America	7.524
1	Malawi	Africa	21.858
2	Nauru	Oceania	9.784
3	Ukraine	Europe	48.684
4	Venezuela	South America	121.738
5	Yemen	Asia	39.403

Subquery challenge

Let’s test your understanding of the subqueries with a challenge problem! Use a subquery to get 2015 economic data for countries that do not have

gov_form of 'Constitutional Monarchy' or
'Republic' in their gov_form.

Here, gov_form stands for the form of the government for each country. Review the different entries for gov_form in the countries table.

Instructions

Select the country code, inflation rate, and unemployment rate.
Order by inflation rate ascending.
Do not use table aliasing in this exercise.

  
-- Select fields
SELECT code, inflation_rate, unemployment_rate
  -- From economies
  FROM economies
  -- Where year is 2015 and code is not in
  WHERE year = 2015 AND code NOT IN
    -- Subquery
    (SELECT code
     FROM countries
     WHERE (gov_form = 'Constitutional Monarchy' OR gov_form LIKE '%Republic'))
-- Order by inflation rate
ORDER BY inflation_rate;

  
sql_stmt = "\
SELECT code, inflation_rate, unemployment_rate \
  FROM countries.economies \
  WHERE year = 2015 AND code NOT IN \
    (SELECT code \
     FROM countries.countries \
     WHERE (gov_form = 'Constitutional Monarchy' OR gov_form LIKE '%%Republic')) \
ORDER BY inflation_rate; \
"

pd.read_sql(sql_stmt, conn)

	code	inflation_rate	unemployment_rate
0	AFG	-1.549	NaN
1	CHE	-1.140	3.178
2	PRI	-0.751	12.000
3	ROU	-0.596	6.812
4	BRN	-0.423	6.900
5	TON	-0.283	NaN
6	OMN	0.065	NaN
7	TLS	0.553	NaN
8	BEL	0.620	8.492
9	CAN	1.132	6.900
10	MNE	1.204	NaN
11	SRB	1.392	18.200
12	AUS	1.461	6.058
13	QAT	1.814	NaN
14	BHR	1.836	NaN
15	WSM	1.923	NaN
16	MYS	2.104	3.100
17	SAU	2.189	5.591
18	HKG	3.037	3.296
19	KWT	3.233	2.072
20	ARE	4.070	NaN
21	MAC	4.564	1.825
22	SWZ	4.960	NaN
23	BTN	6.336	3.200
24	LBY	9.839	NaN
25	SSD	52.813	NaN

Subquery review

Within which SQL clause are subqueries most frequently found?

Answer the question

**WHERE**
~~FROM~~
~~SELECT~~
IN

Course Review

In SQL, a join combines columns from one or more tables in a relational database via a lookup process.
There are four types of joins covered in this course

Types of joins:

INNER JOIN: also denoted as JOIN
Self-joins: special case
OUTER JOIN
LEFT JOIN: also denoted as LEFT OUTER JOIN
RIGHT JOIN: also denoted as RIGHT OUTER JOIN
FULL JOIN: also denoted as FULL OUTER JOIN
CROSS JOIN: create all possible combinations between two tables
Semi-join / Anti-join

Notes

Words appearing in ALL capital letters correspond to joins having simple SQL syntax.
Self-joins, semi-joins, and anti-joins don’t have built-in SQL syntax.
An INNER JOIN keeps only the records in which the key field (or fields) is in both tables.
A LEFT JOIN keeps all the records in fields specified in the left table and includes the matches in the right table based on the key field or fields.
Key field values that don’t match in the right table are included as missing data in the resulting table of a LEFT JOIN.
A RIGHT JOIN keeps all the records specified in the right table and includes the matches from the key field(s) in the left table.
THose that don’t match are included as missing values in the resulting table from the RIGHT JOIN query.
A FULL JOIN is a combination of a LEFT JOIN and a RIGHT JOIN showing exactly which values appear in both tables and those that appear in only one or the other table.
A CROSS JOIN matches all records from fields specified in one table with all records from fields specified in another table.
Remember that a CROSS JOIN does not have an ON or USING clause, but otherwise looks very similar to the code for an INNER JOIN, LEFT JOIN, RIGHT JOIN, or FULL JOIN.

Set Theory Clauses

Recall that UNION includes every record in both tables but DOES NOT double count those that are in both tables.
UNION ALL does replicate those that are in both tables.
INTERSECT gives only those records found in both of the two tables.
EXCEPT gives only those records in one table but not the other.

Semi-joins and Anti-joins

When you’d like to filter your first table based on conditions set on a second table, you should use a semi-join to accomplish the task.
If instead you’d like to filter the first table based on conditions NOT being met on a second table, you should use an anti-join.
Anti-joins are particularly useful in diagnosing problems with other joins in terms of getting fewer or more records than expected.

Types of basic subqueries

The most common type of subquery is done inside of a WHERE clause.
The next most frequent types of subqueries are inside SELECT clauses and inside FROM clauses.
Subqueries can also find their way into the ON statement of a join in ways similar to what you’ve seen inside WHERE clauses too.

Final Challenge

Welcome to the end of the course! The next three exercises will test your knowledge of the content covered in this course and apply many of the ideas you’ve seen to difficult problems. Good luck!

Read carefully over the instructions and solve them step-by-step, thinking about how the different clauses work together.

In this exercise, you’ll need to get the country names and other 2015 data in the economies table and the countries table for Central American countries with an official language.

Instructions

Select unique country names. Also select the total investment and imports fields.
Use a left join with countries on the left. (An inner join would also work, but please use a left join here.)
Match on code in the two tables AND use a subquery inside of ON to choose the appropriate languages records.
Order by country name ascending.
Use table aliasing but not field aliasing in this exercise.

  
-- Select fields
SELECT DISTINCT c.name, e.total_investment, e.imports
  -- From table (with alias)
  FROM countries AS c
    -- Join with table (with alias)
    LEFT JOIN economies AS e
      -- Match on code
      ON (c.code = e.code
      -- and code in Subquery
        AND c.code IN (
          SELECT l.code
          FROM languages AS l
          WHERE official = 'true'
        ) )
  -- Where region and year are correct
  WHERE year = 2015 AND region = 'Central America'
-- Order by field
ORDER BY name;

  
sql_stmt = "\
SELECT DISTINCT c.name, e.total_investment, e.imports \
  FROM countries.countries AS c \
    LEFT JOIN countries.economies AS e \
      ON (c.code = e.code \
        AND c.code IN ( \
          SELECT l.code \
          FROM countries.languages AS l \
          WHERE official = 'true' \
        ) ) \
  WHERE year = 2015 AND region = 'Central America' \
ORDER BY name; \
"

pd.read_sql(sql_stmt, conn)

	name	total_investment	imports
0	Belize	22.014	6.743
1	Costa Rica	20.218	4.629
2	El Salvador	13.983	8.193
3	Guatemala	13.433	15.124
4	Honduras	24.633	9.353
5	Nicaragua	31.862	11.665
6	Panama	46.557	5.898

Final Challenge (2)

Whoofta! That was challenging, huh?

Let’s ease up a bit and calculate the average fertility rate for each region in 2015.

Instructions

Include the name of region, its continent, and average fertility rate aliased as avg_fert_rate.
Sort based on avg_fert_rate ascending.
Remember that you’ll need to GROUP BY all fields that aren’t included in the aggregate function of SELECT.

  
-- Select fields
SELECT c.region, c.continent, AVG(p.fertility_rate) AS avg_fert_rate
  -- From left table
  FROM populations AS p
    -- Join to right table
    INNER JOIN countries AS c
      -- Match on join condition
      ON p.country_code = c.code
  -- Where specific records matching some condition
  WHERE year = 2015
-- Group appropriately
GROUP BY c.continent, c.region
-- Order appropriately
ORDER BY avg_fert_rate;

  
sql_stmt = "\
SELECT c.region, c.continent, AVG(p.fertility_rate) AS avg_fert_rate \
  FROM countries.populations AS p \
    INNER JOIN countries.countries AS c \
      ON p.country_code = c.code \
  WHERE year = 2015 \
GROUP BY c.continent, c.region \
ORDER BY avg_fert_rate; \
"

pd.read_sql(sql_stmt, conn)

	region	continent	avg_fert_rate
0	Southern Europe	Europe	1.426100
1	Eastern Europe	Europe	1.490889
2	Baltic Countries	Europe	1.603333
3	Eastern Asia	Asia	1.620714
4	Western Europe	Europe	1.632500
5	North America	North America	1.765750
6	British Islands	Europe	1.875000
7	Nordic Countries	Europe	1.893333
8	Australia and New Zealand	Oceania	1.911500
9	Caribbean	North America	1.950571
10	Southeast Asia	Asia	2.156000
11	South America	South America	2.274750
12	Central America	North America	2.326375
13	Middle East	Asia	2.547056
14	Southern and Central Asia	Asia	2.634143
15	Micronesia	Oceania	2.864750
16	Northern Africa	Africa	2.908167
17	Southern Africa	Africa	2.990800
18	Melanesia	Oceania	3.135800
19	Polynesia	Oceania	3.244333
20	Eastern Africa	Africa	4.386706
21	Western Africa	Africa	4.960125
22	Central Africa	Africa	4.967889

Final Challenge (3)

Welcome to the last challenge problem. By now you’re a query warrior! Remember that these challenges are designed to take you to the limit to solidify your SQL knowledge! Take a deep breath and solve this step-by-step.

You are now tasked with determining the top 10 capital cities in Europe and the Americas in terms of a calculated percentage using city_proper_pop and metroarea_pop in cities.

Do not use table aliasing in this exercise.

Instructions

Select the city name, country code, city proper population, and metro area population.
Calculate the percentage of metro area population composed of city proper population for each city in cities, aliased as city_perc.
Focus only on capital cities in Europe and the Americas in a subquery.
Make sure to exclude records with missing data on metro area population.
Order the result by city_perc descending.
Then determine the top 10 capital cities in Europe and the Americas in terms of this city_perc percentage.

  
-- Select fields
SELECT name, country_code, city_proper_pop, metroarea_pop,  
      -- Calculate city_perc
      city_proper_pop / metroarea_pop * 100 AS city_perc
  -- From appropriate table
  FROM cities
  -- Where 
  WHERE name IN
    -- Subquery
    (SELECT capital
     FROM countries
     WHERE (continent = 'Europe'
        OR continent LIKE '%America'))
       AND metroarea_pop IS NOT NULL
-- Order appropriately
ORDER BY city_perc desc
-- Limit amount
LIMIT 10;

  
sql_stmt = "\
SELECT name, country_code, city_proper_pop, metroarea_pop, \
      city_proper_pop / metroarea_pop * 100 AS city_perc \
  FROM countries.cities \
  WHERE name IN \
    (SELECT capital \
     FROM countries.countries \
     WHERE (continent = 'Europe' \
        OR continent LIKE '%%America')) \
       AND metroarea_pop IS NOT NULL \
ORDER BY city_perc desc \
LIMIT 10; \
"

pd.read_sql(sql_stmt, conn)

	name	country_code	city_proper_pop	metroarea_pop	city_perc
0	Lima	PER	8852000.0	10750000.0	82.344186
1	Bogota	COL	7878783.0	9800000.0	80.395746
2	Moscow	RUS	12197596.0	16170000.0	75.433493
3	Vienna	AUT	1863881.0	2600000.0	71.687728
4	Montevideo	URY	1305082.0	1947604.0	67.009616
5	Caracas	VEN	1943901.0	2923959.0	66.481817
6	Rome	ITA	2877215.0	4353775.0	66.085523
7	Brasilia	BRA	2556149.0	3919864.0	65.210146
8	London	GBR	8673713.0	13879757.0	62.491822
9	Budapest	HUN	1759407.0	2927944.0	60.090184

  
conn.close()

Certificate

Python, PostgreSQL, Data Joining

This post is licensed under CC BY 4.0 by the author.

	code
0	ARE
1	ARM
2	AZE
3	BHR
4	GEO
5	IRQ
6	ISR
7	YEM
8	JOR
9	KWT
10	CYP
11	LBN
12	OMN
13	QAT
14	SAU
15	SYR
16	TUR
17	PSE

	code
0	ARE
1	ARM
2	AZE
3	BHR
4	GEO
5	IRQ
6	ISR
7	YEM
8	JOR
9	KWT
10	CYP
11	LBN
12	OMN
13	QAT
14	SAU
15	SYR
16	TUR
17	PSE

	code
0	ARE
1	ARM
2	AZE
3	BHR
4	GEO
5	IRQ
6	ISR
7	YEM
8	JOR
9	KWT
10	CYP
11	LBN
12	OMN
13	QAT
14	SAU
15	SYR
16	TUR
17	PSE