Joining Data in SQL
- Course: DataCamp: Joining Data in SQL
- This notebook was created as a reproducible reference.
- The material is from the course
- I completed the exercises
- I installed and configured a PostgreSQL database to run all the queries locally.
- If you find the content beneficial, consider a DataCamp Subscription.
Course Description
Now that you’ve learned the basics of SQL in our Intro to SQL for Data Science course, it’s time to supercharge your queries using joins and relational set theory! In this course you’ll learn all about the power of joining tables while exploring interesting features of countries and their cities throughout the world. You will master inner and outer joins, as well as self-joins, semi-joins, anti-joins and cross joins - fundamental tools in any PostgreSQL wizard’s toolbox. You’ll fear set theory no more, after learning all about unions, intersections, and except clauses through easy-to-understand diagrams and examples. Lastly, you’ll be introduced to the challenging topic of subqueries. You will see a visual perspective to grasp the ideas throughout the course using the mediums of Venn diagrams and other linking illustrations.
Datasets
Imports
1
2
3
4
from sqlalchemy import create_engine, MetaData, Table, Column, Integer, String, inspect, text
import pandas as pd
from pprint import pprint as pp
Pandas Configuration Options
1
2
3
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 300)
pd.set_option('display.expand_frame_repr', True)
PostgreSQL Connection
- In order to run this Notebook, install, setup and configure a PostgreSQL database with the previously mentioned datasets.
- Edit
engine
to use your database username and password.
1
engine = create_engine('postgresql://postgres:postgres@localhost/postgres')
Specify a default schema that should be used for tables defined within this MetaData
instance. This means that when you define a new Table with this MetaData
instance, SQLAlchemy knows to associate that table with the countries
schema. However, this association affects only those operations and objects directly utilizing the MetaData
instance for defining tables or reflecting database schema.
1
meta = MetaData(schema="countries")
1
conn = engine.connect()
1
# conn.close()
Example(s) without pd.DataFrames - use fetchall
For raw SQL queries executed through a connection (conn.execute
), the MetaData
’s schema setting doesn’t automatically apply. The SQL execution context is lower-level and doesn’t use the MetaData
schema setting for its operations. Therefore, when executing raw SQL statements, you must explicitly set the schema in your queries, as mentioned previously, or adjust the search path for the session.
Data Model Overview
Tables and Primary Keys
- Countries
- Primary key:
code
- Fields:
name
,continent
,region
,surface_area
,indep_year
,local_name
,gov_form
,capital
,cap_long
,cap_lat
- Relations: Potentially linked to other tables through the
code
field.
- Primary key:
- Cities
- Primary key:
name
- Fields:
country_code
,city_proper_pop
,metroarea_pop
,urbanarea_pop
- Relations: Linked to
Countries
throughcountry_code
.
- Primary key:
- Languages
- Primary key:
lang_id
- Fields:
code
,name
,percent
,official
- Relations: Possibly linked to
Countries
throughcode
.
- Primary key:
- Economies (general, 2010, 2015 versions)
- Primary key:
code
for each version - Fields:
year
,income_group
,gdp_percapita
(and other economic indicators depending on the table) - Relations: Linked to
Countries
throughcode
.
- Primary key:
- Populations
- Primary key:
pop_id
- Fields:
country_code
,year
,fertility_rate
,life_expectancy
,size
- Relations: Linked to
Countries
throughcountry_code
.
- Primary key:
- Currencies
- Primary key:
curr_id
- Fields:
code
,basic_unit
,curr_code
,frac_unit
,frac_perbasic
- Relations: Potentially linked to
Countries
throughcode
.
- Primary key:
- Countries Plus
- Primary key:
code
- Fields:
name
,continent
,surface_area
,geosize_group
- Relations: An alternative or additional representation of
Countries
.
- Primary key:
Data Integrity and Relationships
Even though the SQL file lacks explicit foreign key constraints, you can maintain data integrity through application logic or by adding these constraints yourself if you are managing the database schema. These constraints would ensure that all references between tables are valid and consistent, preventing orphan records and maintaining referential integrity.
1
result = conn.execute(text("SELECT * FROM countries.cities"))
1
rows = result.fetchall()
1
[x for x in rows]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
[('Abidjan', 'CIV', 4765000.0, None, 4765000.0),
('Abu Dhabi', 'ARE', 1145000.0, None, 1145000.0),
('Abuja', 'NGA', 1235880.0, 6000000.0, 1235880.0),
('Accra', 'GHA', 2070463.0, 4010054.0, 2070463.0),
('Addis Ababa', 'ETH', 3103673.0, 4567857.0, 3103673.0),
('Ahmedabad', 'IND', 5570585.0, None, 5570585.0),
('Alexandria', 'EGY', 4616625.0, None, 4616625.0),
('Algiers', 'DZA', 3415811.0, 5000000.0, 3415811.0),
('Almaty', 'KAZ', 1703481.0, None, 1703481.0),
('Ankara', 'TUR', 5271000.0, 4585000.0, 5271000.0),
('Auckland', 'NZL', 1495000.0, 1614300.0, 1495000.0),
('Baghdad', 'IRQ', 7180889.0, None, 7180889.0),
('Baku', 'AZE', 3202300.0, 4308740.0, 3202300.0),
('Bandung', 'IDN', 2575478.0, 6965655.0, 2575478.0),
('Bangkok', 'THA', 8280925.0, 14998000.0, 8280925.0),
('Barcelona', 'ESP', 1604555.0, 5375774.0, 1604555.0),
('Barranquilla', 'COL', 1386865.0, 2370753.0, 1386865.0),
('Basra', 'IRQ', 2750000.0, None, 2750000.0),
('Beijing', 'CHN', 21516000.0, 24900000.0, 21516000.0),
('Belo Horizonte', 'BRA', 2502557.0, 5156217.0, 2502557.0),
('Bengaluru', 'IND', 8425970.0, 9807000.0, 8425970.0),
('Berlin', 'DEU', 3517424.0, 5871022.0, 3517424.0),
('Bhopal', 'IND', 1798218.0, 1864389.0, 1798218.0),
('Birmingham', 'GBR', 1111300.0, 3683000.0, 1111300.0),
('Bogota', 'COL', 7878783.0, 9800000.0, 7878783.0),
('Brasilia', 'BRA', 2556149.0, 3919864.0, 2556149.0),
('Brazzaville', 'COG', 1827000.0, None, 1827000.0),
('Brisbane', 'AUS', 1180285.0, 2349699.0, 1180285.0),
('Bucharest', 'ROM', 1883425.0, 2272163.0, 1883425.0),
('Budapest', 'HUN', 1759407.0, 2927944.0, 1759407.0),
('Buenos Aires', 'ARG', 3054300.0, 14122000.0, 3054300.0),
('Busan', 'KOR', 3510833.0, 8202239.0, 3510833.0),
('Cairo', 'EGY', 10230350.0, 18290000.0, 10230350.0),
('Calgary', 'CAN', 1235171.0, 1214839.0, 1235171.0),
('Cali', 'COL', 2400653.0, 3400000.0, 2400653.0),
('Caloocan', 'PHL', 1583978.0, None, 1583978.0),
('Campinas', 'BRA', 1164098.0, 3094181.0, 1164098.0),
('Cape Town', 'ZAF', 3740026.0, None, 3740026.0),
('Caracas', 'VEN', 1943901.0, 2923959.0, 1943901.0),
('Casablanca', 'MAR', 5117832.0, 6861739.0, 5117832.0),
('Changchun', 'CHN', 3815270.0, 7674439.0, 3815270.0),
('Changsha', 'CHN', 7044118.0, None, 7044118.0),
('Chaozhou', 'CHN', 2669844.0, None, 2669844.0),
('Chengdu', 'CHN', 4741929.0, 10376000.0, 4741929.0),
('Chennai', 'IND', 7088000.0, None, 7088000.0),
('Chicago', 'USA', 2695598.0, 9156000.0, 2695598.0),
('Chittagong', 'BGD', 2581643.0, 4009423.0, 2581643.0),
('Chongqing', 'CHN', 8189800.0, 52100100.0, 8189800.0),
('Cologne', 'DEU', 1057327.0, 3573500.0, 1057327.0),
('Cordoba', 'ARG', 1330023.0, 1528000.0, 1330023.0),
('Curitiba', 'BRA', 1879355.0, 3400000.0, 1879355.0),
('Daegu', 'KOR', 2492994.0, None, 2492994.0),
('Daejeon', 'KOR', 1535028.0, None, 1535028.0),
('Dakar', 'SEN', 1146053.0, 2452656.0, 1146053.0),
('Dalian', 'CHN', 2146099.0, 5935638.0, 2146099.0),
('Dallas', 'USA', 1317929.0, 7233323.0, 1317929.0),
('Dar es Salaam', 'TZA', 4364541.0, None, 4364541.0),
('Davao City', 'PHL', 1632991.0, 2516216.0, 1632991.0),
('Delhi', 'IND', 16787940.0, 24998000.0, 16787940.0),
('Dhaka', 'BGD', 14543124.0, None, 14543124.0),
('Dongguan', 'CHN', 8220207.0, None, 8220207.0),
('Douala', 'CMR', 2446945.0, None, 2446945.0),
('Dubai', 'ARE', 2643410.0, None, 2643410.0),
('Durban', 'ZAF', 3442361.0, None, 3442361.0),
('Ekurhuleni', 'ZAF', 3178470.0, None, 3178470.0),
('Faisalabad', 'PAK', 6480765.0, 3675000.0, 6480765.0),
('Fez', 'MAR', 1112072.0, None, 1112072.0),
('Fortaleza', 'BRA', 2609716.0, 4019213.0, 2609716.0),
('Foshan', 'CHN', 6151622.0, None, 6151622.0),
('Fukuoka', 'JPN', 1483052.0, 5590378.0, 1483052.0),
('Fuzhou', 'CHN', 7115369.0, None, 7115369.0),
('Giza', 'EGY', 4239988.0, None, 4239988.0),
('Guadalajara', 'MEX', 1495189.0, 4424252.0, 1495189.0),
('Guangzhou', 'CHN', 14043500.0, 44259000.0, 14043500.0),
('Guatemala City', 'GTM', 2110100.0, 4500000.0, 2110100.0),
('Guayaquil', 'ECU', 3600000.0, 5000000.0, 3600000.0),
('Gujranwala', 'PAK', 2700003.0, None, 2700003.0),
('Hamburg', 'DEU', 1787408.0, None, 1787408.0),
('Hangzhou', 'CHN', 3560391.0, None, 3560391.0),
('Hanoi', 'VNM', 6844100.0, None, 6844100.0),
('Harare', 'ZWE', 1606000.0, None, 1606000.0),
('Harbin', 'CHN', 4280701.0, 10635971.0, 4280701.0),
('Havana', 'CUB', 2106146.0, None, 2106146.0),
('Hefei', 'CHN', 3352076.0, None, 3352076.0),
('Hiroshima', 'JPN', 1196274.0, None, 1196274.0),
('Ho Chi Minh City', 'VNM', 7681700.0, None, 7681700.0),
('Hong Kong', 'CHN', 7374900.0, None, 7374900.0),
('Houston', 'USA', 2489558.0, 6490180.0, 2489558.0),
('Hyderabad (India)', 'IND', 7859250.0, None, 7859250.0),
('Hyderabad', 'PAK', 3429471.0, None, 3429471.0),
('Ibadan', 'NGA', 1338659.0, 2837000.0, 1338659.0),
('Incheon', 'KOR', 2978367.0, None, 2978367.0),
('Isfahan', 'IRN', 2243249.0, None, 2243249.0),
('Islamabad', 'PAK', 1900000.0, 2200000.0, 1900000.0),
('Istanbul', 'TUR', 14025000.0, 13520000.0, 14025000.0),
('Izmir', 'TUR', 4168000.0, 3019000.0, 4168000.0),
('Jaipur', 'IND', 3073350.0, None, 3073350.0),
('Jakarta', 'IDN', 10075310.0, 30539000.0, 10075310.0),
('Jeddah', 'SAU', 3456259.0, None, 3456259.0),
('Jinan', 'CHN', 2009273.0, 5853196.0, 2009273.0),
('Johannesburg', 'ZAF', 4434827.0, None, 4434827.0),
('Kabul', 'AFG', 3414100.0, None, 3414100.0),
('Kampala', 'UGA', 1507080.0, None, 1507080.0),
('Kano', 'NGA', 2153225.0, 3395000.0, 2153225.0),
('Kanpur', 'IND', 2768057.0, 3152317.0, 2768057.0),
('Kaohsiung', 'TWN', 2778918.0, None, 2778918.0),
('Karachi', 'PAK', 27506000.0, 25400000.0, 27506000.0),
('Karaj', 'IRN', 1973470.0, None, 1973470.0),
('Kawasaki', 'JPN', 1496035.0, None, 1496035.0),
('Kharkov', 'UKR', 1439566.0, 1650000.0, 1439566.0),
('Khartoum', 'SDN', 3639598.0, 5274321.0, 3639598.0),
('Kiev', 'UKR', 2908703.0, None, 2908703.0),
('Kinshasa', 'COD', 10130000.0, 13265000.0, 10130000.0),
('Kobe', 'JPN', 1536499.0, None, 1536499.0),
('Kochi', 'IND', 2232456.0, 4221140.0, 2232456.0),
('Kolkata', 'IND', 4486679.0, 14667000.0, 4486679.0),
('Kuala Lumpur', 'MYS', 1768000.0, 7200000.0, 1768000.0),
('Kwangju', 'KOR', 1477780.0, None, 1477780.0),
('Kyoto', 'JPN', 1474570.0, None, 1474570.0),
('Lagos', 'NGA', 16060303.0, 21000000.0, 16060303.0),
('Lahore', 'PAK', 10355000.0, 13569000.0, 10355000.0),
('Lanzhou', 'CHN', 2177130.0, 3616163.0, 2177130.0),
('Lima', 'PER', 8852000.0, 10750000.0, 8852000.0),
('London', 'GBR', 8673713.0, 13879757.0, 8673713.0),
('Los Angeles', 'USA', 3884307.0, 15058000.0, 3884307.0),
('Luanda', 'AGO', 2825311.0, None, 2825311.0),
('Lucknow', 'IND', 2815601.0, None, 2815601.0),
('Lusaka', 'ZMB', 1742979.0, 2467467.0, 1742979.0),
('Madrid', 'ESP', 3207247.0, None, 3207247.0),
('Makassar', 'IDN', 1338633.0, 1976168.0, 1338633.0),
('Managua', 'NIC', 2560789.0, None, 2560789.0),
('Mandalay', 'MMR', 1319452.0, 1726889.0, 1319452.0),
('Manila', 'PHL', 1780148.0, 12877253.0, 1780148.0),
('Maputo', 'MOZ', 1766184.0, 1766823.0, 1766184.0),
('Maracaibo', 'VEN', 1599940.0, 3897655.0, 1599940.0),
('Mashhad', 'IRN', 3312090.0, 3372660.0, 3312090.0),
('Medan', 'IDN', 2097610.0, 4103696.0, 2097610.0),
('Medellin', 'COL', 2441123.0, 3731447.0, 2441123.0),
('Mexico City', 'MEX', 8974724.0, 20063000.0, 8974724.0),
('Milan', 'ITA', 1359905.0, 3206465.0, 1359905.0),
('Minsk', 'BLR', 1959781.0, None, 1959781.0),
('Monterrey', 'MEX', 1130960.0, 4520329.0, 1130960.0),
('Montevideo', 'URY', 1305082.0, 1947604.0, 1305082.0),
('Montreal', 'CAN', 1649519.0, 4127100.0, 1649519.0),
('Moscow', 'RUS', 12197596.0, 16170000.0, 12197596.0),
('Multan', 'PAK', 3117000.0, None, 3117000.0),
('Mumbai', 'IND', 12478447.0, 17712000.0, 12478447.0),
('Munich', 'DEU', 1450381.0, 2606021.0, 1450381.0),
('Nagoya', 'JPN', 2296014.0, 9107414.0, 2296014.0),
('Nagpur', 'IND', 2405665.0, 2497870.0, 2405665.0),
('Nairobi', 'KEN', 3138369.0, None, 3138369.0),
('Nanjing', 'CHN', 8230000.0, 34360000.0, 8230000.0),
('New Taipei City', 'TWN', 3954929.0, None, 3954929.0),
('New York City', 'USA', 8550405.0, 20182304.0, 8550405.0),
('Ningbo', 'CHN', 3491597.0, 7639000.0, 3491597.0),
('Nizhny Novgorod', 'RUS', 1250619.0, None, 1250619.0),
('Novosibirsk', 'RUS', 1567087.0, None, 1567087.0),
('Omsk', 'RUS', 1154116.0, None, 1154116.0),
('Oran', 'DZA', 1560329.0, 3454078.0, 1560329.0),
('Osaka', 'JPN', 2691742.0, 19341976.0, 2691742.0),
('Ouagadougou', 'BFA', 2200000.0, 2500000.0, 2200000.0),
('Palembang', 'IDN', 1708413.0, None, 1708413.0),
('Paris', 'FRA', 2229621.0, 10601122.0, 2229621.0),
('Patna', 'IND', 1683200.0, 2231554.0, 1683200.0),
('Peshawar', 'PAK', 3201000.0, None, 3201000.0),
('Philadelphia', 'USA', 1567872.0, 6069875.0, 1567872.0),
('Phnom Penh', 'KHM', 2234566.0, None, 2234566.0),
('Phoenix', 'USA', 1563025.0, 4574531.0, 1563025.0),
('Porto Alegre', 'BRA', 1476867.0, None, 1476867.0),
('Prague', 'CZE', 1324000.0, None, 1324000.0),
('Pune', 'IND', 3115431.0, None, 3115431.0),
('Pyongyang', 'PRK', 3255388.0, None, 3255388.0),
('Qingdao', 'CHN', 6188100.0, 9046200.0, 6188100.0),
('Quanzhou', 'CHN', 8128533.0, 6107475.0, 8128533.0),
('Quezon City', 'PHL', 2936116.0, None, 2936116.0),
('Quito', 'ECU', 2671191.0, 4700000.0, 2671191.0),
('Rawalpindi', 'PAK', 3198911.0, None, 3198911.0),
('Recife', 'BRA', 1555039.0, 3743854.0, 1555039.0),
('Rio de Janeiro', 'BRA', 6429923.0, 12727000.0, 6429923.0),
('Riyadh', 'SAU', 5676621.0, None, 5676621.0),
('Rome', 'ITA', 2877215.0, 4353775.0, 2877215.0),
('Rosario', 'ARG', 1193605.0, 1276000.0, 1193605.0),
('Rostov-on-Don', 'RUS', 1119900.0, None, 1119900.0),
('Saint Petersburg', 'RUS', 5191690.0, 5900000.0, 5191690.0),
('Saitama', 'JPN', 1226656.0, None, 1226656.0),
('Salvador', 'BRA', 2902927.0, 3919864.0, 2902927.0),
('San Antonio', 'USA', 1469845.0, 2454061.0, 1469845.0),
('San Diego', 'USA', 1394928.0, 3095313.0, 1394928.0),
("Sana'a", 'YEM', 1937451.0, 2167961.0, 1937451.0),
('Santa Cruz de la Sierra', 'BOL', 1453549.0, 1749000.0, 1453549.0),
('Santiago', 'CHL', 5743719.0, None, 5743719.0),
('Sao Paulo', 'BRA', 12038175.0, 21090792.0, 12038175.0),
('Sapporo', 'JPN', 1918096.0, 2584880.0, 1918096.0),
('Semarang', 'IDN', 1555984.0, 3183516.0, 1555984.0),
('Seoul', 'KOR', 9995784.0, 12700000.0, 9995784.0),
('Shanghai', 'CHN', 24256800.0, 34750000.0, 24256800.0),
('Shantou', 'CHN', 5391028.0, 11535677.0, 5391028.0),
('Shenyang', 'CHN', 8106171.0, None, 8106171.0),
('Shenzhen', 'CHN', 10778900.0, None, 10778900.0),
('Shijiazhuang', 'CHN', 4303700.0, 10701600.0, 4303700.0),
('Shiraz', 'IRN', 1869001.0, None, 1869001.0),
('Singapore', 'SGP', 5535000.0, None, 5535000.0),
('Surabaya', 'IDN', 2765487.0, 7302283.0, 2765487.0),
('Surat', 'IND', 4462002.0, None, 4462002.0),
('Suzhou', 'CHN', 10650501.0, None, 10650501.0),
("T'bilisi", 'GEO', 1118035.0, 1485293.0, 1118035.0),
('Tabriz', 'IRN', 1733033.0, None, 1733033.0),
('Taichung', 'TWN', 2752413.0, None, 2752413.0),
('Tainan', 'TWN', 1885252.0, None, 1885252.0),
('Taipei', 'TWN', 2704974.0, None, 2704974.0),
('Tangshan', 'CHN', 3187171.0, 7536521.0, 3187171.0),
('Tashkent', 'UZB', 2309600.0, None, 2309600.0),
('Tehran', 'IRN', 8154051.0, 13532000.0, 8154051.0),
('Tianjin', 'CHN', 6859779.0, 15469500.0, 6859779.0),
('Tijuana', 'MEX', 1696923.0, 1895797.0, 1696923.0),
('Tokyo', 'JPN', 13513734.0, 37843000.0, 13513734.0),
('Toronto', 'CAN', 2731571.0, 6417516.0, 2731571.0),
('Tripoli', 'LBY', 1126000.0, 2267000.0, 1126000.0),
('Tunis', 'TUN', 1056247.0, 2643695.0, 1056247.0),
('Ulsan', 'KOR', 1163690.0, None, 1163690.0),
('Vienna', 'AUT', 1863881.0, 2600000.0, 1863881.0),
('Vijayawada', 'IND', 1491202.0, None, 1491202.0),
('Visakhapatnam', 'IND', 2035922.0, 5340000.0, 2035922.0),
('Warsaw', 'POL', 1753977.0, 3100844.0, 1753977.0),
('Wenzhou', 'CHN', 3039439.0, None, 3039439.0),
('Wuhan', 'CHN', 6886253.0, None, 6886253.0),
("Xi'an", 'CHN', 8705600.0, 13569700.0, 8705600.0),
('Xiamen', 'CHN', 3531347.0, 5114758.0, 3531347.0),
('Yangon', 'MMR', 5214000.0, None, 5214000.0),
('Yaounde', 'CMR', 2440462.0, None, 2440462.0),
('Yekaterinburg', 'RUS', 1428042.0, None, 1428042.0),
('Yerevan', 'ARM', 1060138.0, None, 1060138.0),
('Yokohama', 'JPN', 3726167.0, None, 3726167.0),
('Zhengzhou', 'CHN', 4122087.0, None, 4122087.0),
('Zhongshan', 'CHN', 3121275.0, None, 3121275.0),
('Zunyi', 'CHN', 6127009.0, None, 6127009.0)]
1
2
3
4
cities = conn.execute(text("select * \
from countries.countries \
inner join countries.cities \
on countries.cities.country_code = countries.code"))
1
cities_res = cities.fetchall()
1
cities_list = [x for i, x in enumerate(cities_res) if i < 10]
1
cities_list
1
2
3
4
5
6
7
8
9
10
[('CIV', "Cote d'Ivoire", 'Africa', 'Western Africa', 322463.0, 1960, 'Cote d\x92Ivoire', 'Republic', 'Yamoussoukro', -4.0305, 5.332, 'Abidjan', 'CIV', 4765000.0, None, 4765000.0),
('ARE', 'United Arab Emirates', 'Asia', 'Middle East', 83600.0, 1971, 'Al-Imarat al-´Arabiya al-Muttahida', 'Emirate Federation', 'Abu Dhabi', 54.3705, 24.4764, 'Abu Dhabi', 'ARE', 1145000.0, None, 1145000.0),
('NGA', 'Nigeria', 'Africa', 'Western Africa', 923768.0, 1960, 'Nigeria', 'Federal Republic', 'Abuja', 7.48906, 9.05804, 'Abuja', 'NGA', 1235880.0, 6000000.0, 1235880.0),
('GHA', 'Ghana', 'Africa', 'Western Africa', 238533.0, 1957, 'Ghana', 'Republic', 'Accra', -0.20795, 5.57045, 'Accra', 'GHA', 2070463.0, 4010054.0, 2070463.0),
('ETH', 'Ethiopia', 'Africa', 'Eastern Africa', 1104300.0, -1000, 'YeItyop´iya', 'Republic', 'Addis Ababa', 38.7468, 9.02274, 'Addis Ababa', 'ETH', 3103673.0, 4567857.0, 3103673.0),
('IND', 'India', 'Asia', 'Southern and Central Asia', 3287260.0, 1947, 'Bharat/India', 'Federal Republic', 'New Delhi', 77.225, 28.6353, 'Ahmedabad', 'IND', 5570585.0, None, 5570585.0),
('EGY', 'Egypt', 'Africa', 'Northern Africa', 1001450.0, 1922, 'Misr', 'Republic', 'Cairo', 31.2461, 30.0982, 'Alexandria', 'EGY', 4616625.0, None, 4616625.0),
('DZA', 'Algeria', 'Africa', 'Northern Africa', 2381740.0, 1962, 'Al-Jaza\x92ir/Algerie', 'Republic', 'Algiers', 3.05097, 36.7397, 'Algiers', 'DZA', 3415811.0, 5000000.0, 3415811.0),
('KAZ', 'Kazakhstan', 'Asia', 'Southern and Central Asia', 2724900.0, 1991, 'Qazaqstan', 'Republic', 'Astana', 71.4382, 51.1879, 'Almaty', 'KAZ', 1703481.0, None, 1703481.0),
('TUR', 'Turkey', 'Asia', 'Middle East', 774815.0, 1923, 'Turkiye', 'Republic', 'Ankara', 32.3606, 39.7153, 'Ankara', 'TUR', 5271000.0, 4585000.0, 5271000.0)]
Introduction to joins
In this chapter, you’ll be introduced to the concept of joining tables, and explore the different ways you can enrich your queries using inner joins and self-joins. You’ll also see how to use the case statement to split up a field into different categories.
Introduction to INNER JOIN
1
cities = conn.execute(text("select * from countries.cities"))
1
cities_df = pd.read_sql("select * from countries.cities", conn)
1
cities_df.head()
name | country_code | city_proper_pop | metroarea_pop | urbanarea_pop | |
---|---|---|---|---|---|
0 | Abidjan | CIV | 4765000.0 | NaN | 4765000.0 |
1 | Abu Dhabi | ARE | 1145000.0 | NaN | 1145000.0 |
2 | Abuja | NGA | 1235880.0 | 6000000.0 | 1235880.0 |
3 | Accra | GHA | 2070463.0 | 4010054.0 | 2070463.0 |
4 | Addis Ababa | ETH | 3103673.0 | 4567857.0 | 3103673.0 |
1
2
sql_stmt = "SELECT * FROM countries.cities INNER JOIN countries.countries ON countries.cities.country_code = countries.countries.code"
pd.read_sql(sql_stmt, conn).head()
name | country_code | city_proper_pop | metroarea_pop | urbanarea_pop | code | name | continent | region | surface_area | indep_year | local_name | gov_form | capital | cap_long | cap_lat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abidjan | CIV | 4765000.0 | NaN | 4765000.0 | CIV | Cote d'Ivoire | Africa | Western Africa | 322463.0 | 1960 | Cote dIvoire | Republic | Yamoussoukro | -4.03050 | 5.33200 |
1 | Abu Dhabi | ARE | 1145000.0 | NaN | 1145000.0 | ARE | United Arab Emirates | Asia | Middle East | 83600.0 | 1971 | Al-Imarat al-´Arabiya al-Muttahida | Emirate Federation | Abu Dhabi | 54.37050 | 24.47640 |
2 | Abuja | NGA | 1235880.0 | 6000000.0 | 1235880.0 | NGA | Nigeria | Africa | Western Africa | 923768.0 | 1960 | Nigeria | Federal Republic | Abuja | 7.48906 | 9.05804 |
3 | Accra | GHA | 2070463.0 | 4010054.0 | 2070463.0 | GHA | Ghana | Africa | Western Africa | 238533.0 | 1957 | Ghana | Republic | Accra | -0.20795 | 5.57045 |
4 | Addis Ababa | ETH | 3103673.0 | 4567857.0 | 3103673.0 | ETH | Ethiopia | Africa | Eastern Africa | 1104300.0 | -1000 | YeItyop´iya | Republic | Addis Ababa | 38.74680 | 9.02274 |
1
2
3
4
sql_stmt = "SELECT countries.cities.name as city, countries.countries.name as country, \
countries.countries.region FROM countries.cities INNER JOIN countries.countries ON \
countries.cities.country_code = countries.countries.code"
pd.read_sql(sql_stmt, conn).head()
city | country | region | |
---|---|---|---|
0 | Abidjan | Cote d'Ivoire | Western Africa |
1 | Abu Dhabi | United Arab Emirates | Middle East |
2 | Abuja | Nigeria | Western Africa |
3 | Accra | Ghana | Western Africa |
4 | Addis Ababa | Ethiopia | Eastern Africa |
INNER JOIN via USING
1
2
3
4
5
6
SELECT left_table.id as L_id
left_table.val as L_val
right_table.val as R_val
FROM left_table
INNER JOIN right_table
ON left_table.id = right_table.id;
- When the key field you’d like to join on is the same name in both tables, you can use a
USING
clause instead of theON
clause.
1
2
3
4
5
6
SELECT left_table.id as L_id
left_table.val as L_val
right_table.val as R_val
FROM left_table
INNER JOIN right_table
USING (id);
Countries with prime ministers and presidents
1
2
3
4
SELECT p1.country, p1.continent, prime_minister, president
FROM leaders.presidents AS p1
INNER JOIN leaders.prime_ministers as p2
USING (country);
1
2
3
4
5
6
sql_stmt = "SELECT p1.country, p1.continent, prime_minister, president \
FROM leaders.presidents AS p1 \
INNER JOIN leaders.prime_ministers as p2 \
USING (country)"
pd.read_sql(sql_stmt, conn).head()
country | continent | prime_minister | president | |
---|---|---|---|---|
0 | Egypt | Africa | Sherif Ismail | Abdel Fattah el-Sisi |
1 | Portugal | Europe | Antonio Costa | Marcelo Rebelo de Sousa |
2 | Haiti | North America | Jack Guy Lafontant | Jovenel Moise |
3 | Vietnam | Asia | Nguyen Xuan Phuc | Tran Dai Quang |
Exercises
Review inner join using on
Why does the following code result in an error?
1
2
3
SELECT c.name AS country, l.name AS language
FROM countries AS c
INNER JOIN languages AS l;
INNER JOIN
requires a specification of the key field (or fields) in each table.
Inner join with using
When joining tables with a common field name, e.g.
1
2
3
4
SELECT *
FROM countries
INNER JOIN economies
ON countries.code = economies.code
You can use USING
as a shortcut:
1
2
3
4
SELECT *
FROM countries
INNER JOIN economies
USING(code)
You’ll now explore how this can be done with the countries
and languages
tables.
Instructions
- Inner join
countries
on the left andlanguages
on the right withUSING(code)
. - Select the fields corresponding to:
- country name
AS country
, - continent name,
- language name
AS language
, and - whether or not the language is official.
- country name
- Remember to alias your tables using the first letter of their names.
1
2
3
4
5
6
7
8
-- 4. Select fields
SELECT c.name as country, c.continent, l.name as language, l.official
-- 1. From countries (alias as c)
FROM countries as c
-- 2. Join to languages (as l)
INNER JOIN languages as l
-- 3. Match using code
USING (code)
1
2
3
4
5
6
sql_stmt = "SELECT c.name as country, c.continent, l.name as language, l.official \
FROM countries.countries as c \
INNER JOIN countries.languages as l \
USING (code)"
pd.read_sql(sql_stmt, conn).head()
country | continent | language | official | |
---|---|---|---|---|
0 | Afghanistan | Asia | Dari | True |
1 | Afghanistan | Asia | Pashto | True |
2 | Afghanistan | Asia | Turkic | False |
3 | Afghanistan | Asia | Other | False |
4 | Albania | Europe | Albanian | True |
Self-ish joins, just in CASE
self-join on prime_ministers
1
2
3
4
5
sql_stmt = "SELECT * \
FROM leaders.prime_ministers"
pm_df = pd.read_sql(sql_stmt, conn)
pm_df.head()
country | continent | prime_minister | |
---|---|---|---|
0 | Egypt | Africa | Sherif Ismail |
1 | Portugal | Europe | Antonio Costa |
2 | Vietnam | Asia | Nguyen Xuan Phuc |
3 | Haiti | North America | Jack Guy Lafontant |
4 | India | Asia | Narendra Modi |
- inner joins where a table is joined with itself
- self join
- Explore how to slice a numerical field into categories using the CASE command
- Self-joins are used to compare values in a field to other values of the same field from within the same table
- Recall the prime ministers table:
- What if you wanted to create a new table showing countries that are in the same continenet matched as pairs?
1
2
3
4
SELECT p1.country AS country1, p2.country AS country2, p1.continent
FROM leaders.prime_ministers as p1
INNER JOIN prime_ministers as p2
ON p1.continent = p2.continent;
1
2
3
4
5
6
7
sql_stmt = "SELECT p1.country AS country1, p2.country AS country2, p1.continent \
FROM leaders.prime_ministers as p1 \
INNER JOIN leaders.prime_ministers as p2 \
ON p1.continent = p2.continent"
pm_df_1 = pd.read_sql(sql_stmt, conn)
pm_df_1.head()
country1 | country2 | continent | |
---|---|---|---|
0 | Egypt | Egypt | Africa |
1 | Portugal | Spain | Europe |
2 | Portugal | Norway | Europe |
3 | Portugal | Portugal | Europe |
4 | Vietnam | Oman | Asia |
- The country column is selected twice as well as continent.
- The prime ministers table is on both the left and the right.
- The vital step is setting the key columns by which we match the table to itself.
- For each country, there will be a match if the country in the “right table” (that’s also prime_ministers) is in the same continent.
- This is a pairing of each country with every other country in its same continent
- Conditions where country1 = country2 should not be included in the table
Finishing off the self-join on prime_ministers
1
2
3
4
SELECT p1.country AS country1, p2.country AS country2, p1.continent
FROM leaders.prime_ministers as p1
INNER JOIN prime_ministers as p2
ON p1.continent = p2.continent AND p1.country <> p2.country;
1
2
3
4
5
6
7
sql_stmt = "SELECT p1.country AS country1, p2.country AS country2, p1.continent \
FROM leaders.prime_ministers as p1 \
INNER JOIN leaders.prime_ministers as p2 \
ON p1.continent = p2.continent AND p1.country <> p2.country"
pm_df_2 = pd.read_sql(sql_stmt, conn)
pm_df_2.head()
country1 | country2 | continent | |
---|---|---|---|
0 | Portugal | Spain | Europe |
1 | Portugal | Norway | Europe |
2 | Vietnam | Oman | Asia |
3 | Vietnam | Brunei | Asia |
4 | Vietnam | India | Asia |
1
pm_df_1.equals(pm_df_2)
1
False
AND
clause can check that multiple conditions are met.- Now a match will not be made between prime_minister and itself if the countries match
CASE WHEN and THEN
- The states table contains numeric data about different countries in the six inhabited world continents
- Group the year of independence into categories of:
- before 1900
- between 1900 and 1930
- and after 1930
- CASE is a way to do multiple if-then-else statements
1
2
3
4
5
6
7
SELECT name, continent, indep_year,
CASE WHEN indep_year < 1900 THEN 'before 1900'
WHEN indep_year <= 1930 THEN 'between 1900 and 1930'
ELSE 'after 1930' END
AS indep_year_group
FROM states
ORDER BY indep_year_group;
1
2
3
4
5
6
7
8
9
sql_stmt = "SELECT name, continent, indep_year, \
CASE WHEN indep_year < 1900 THEN 'before 1900' \
WHEN indep_year <= 1930 THEN 'between 1900 and 1930' \
ELSE 'after 1930' END \
AS indep_year_group \
FROM leaders.states \
ORDER BY indep_year_group"
pd.read_sql(sql_stmt, conn)
name | continent | indep_year | indep_year_group | |
---|---|---|---|---|
0 | Brunei | Asia | 1984 | after 1930 |
1 | India | Asia | 1947 | after 1930 |
2 | Oman | Asia | 1951 | after 1930 |
3 | Vietnam | Asia | 1945 | after 1930 |
4 | Liberia | Africa | 1847 | before 1900 |
5 | Chile | South America | 1810 | before 1900 |
6 | Haiti | North America | 1804 | before 1900 |
7 | Portugal | Europe | 1143 | before 1900 |
8 | Spain | Europe | 1492 | before 1900 |
9 | Uruguay | South America | 1828 | before 1900 |
10 | Norway | Europe | 1905 | between 1900 and 1930 |
11 | Australia | Oceania | 1901 | between 1900 and 1930 |
12 | Egypt | Africa | 1922 | between 1900 and 1930 |
Exercises
Self-join
In this exercise, you’ll use the populations
table to perform a self-join to calculate the percentage increase in population from 2010 to 2015 for each country code!
Since you’ll be joining the populations
table to itself, you can alias populations
as p1
and also populations
as p2
. This is good practice whenever you are aliasing and your tables have the same first letter. Note that you are required to alias the tables with self-joins.
Instructions 1/3
- Join
populations
with itselfON country_code
. - Select the
country_code
fromp1
and thesize
field from bothp1
andp2
. SQL won’t allow same-named fields, so aliasp1.size as size2010
andp2.size as size2015
.
1
2
3
4
5
6
7
8
9
10
-- 4. Select fields with aliases
SELECT p1.size as size2010,
p1.country_code,
p2.size as size2015
-- 1. From populations (alias as p1)
FROM countries.populations as p1
-- 2. Join to itself (alias as p2)
INNER JOIN countries.populations as p2
-- 3. Match on country code
ON p1.country_code = p2.country_code
1
2
3
4
5
6
sql_stmt = "SELECT p1.size as size2010, p1.country_code, p2.size as size2015 \
FROM countries.populations as p1 \
INNER JOIN countries.populations as p2 \
ON p1.country_code = p2.country_code"
pd.read_sql(sql_stmt, conn).head()
size2010 | country_code | size2015 | |
---|---|---|---|
0 | 101597.0 | ABW | 103889.0 |
1 | 101597.0 | ABW | 101597.0 |
2 | 103889.0 | ABW | 103889.0 |
3 | 103889.0 | ABW | 101597.0 |
4 | 27962208.0 | AFG | 32526562.0 |
Instructions 2/3
Notice from the result that for each country_code you have four entries laying out all combinations of 2010 and 2015.
- Extend the
ON
in your query to include only those records where thep1.year
(2010) matches withp2.year - 5
(2015 - 5 = 2010). This will omit the three entries percountry_code
that you aren’t interested in.
1
2
3
4
5
6
7
8
9
10
11
12
-- 4. Select fields with aliases
SELECT p1.country_code,
p1.size as size2010,
p2.size as size2015
-- 1. From populations (alias as p1)
FROM countries.populations as p1
-- 2. Join to itself (alias as p2)
INNER JOIN countries.populations as p2
-- 3. Match on country code
ON p1.country_code = p2.country_code
-- 4. and year (with calculation)
AND p1.year = (p2.year - 5)
1
2
3
4
5
6
7
sql_stmt = "SELECT p1.size as size2010, p1.country_code, p2.size as size2015 \
FROM countries.populations as p1 \
INNER JOIN countries.populations as p2 \
ON p1.country_code = p2.country_code \
AND p1.year = (p2.year - 5)"
pd.read_sql(sql_stmt, conn).head()
size2010 | country_code | size2015 | |
---|---|---|---|
0 | 101597.0 | ABW | 103889.0 |
1 | 27962208.0 | AFG | 32526562.0 |
2 | 21219954.0 | AGO | 25021974.0 |
3 | 2913021.0 | ALB | 2889167.0 |
4 | 84419.0 | AND | 70473.0 |
Instructions 3/3
As you just saw, you can also use SQL to calculate values like p2.year - 5
for you. With two fields like size2010
and size2015
, you may want to determine the percentage increase from one field to the next:
With two numeric fields A
and B
, the percentage growth from A
to B
can be calculated as \(\frac{(B−A)}{A}∗100.0\).
Add a new field to SELECT
, aliased as growth_perc
, that calculates the percentage population growth from 2010 to 2015 for each country, using p2.size
and p1.size
.
1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT p1.country_code,
p1.size AS size2010,
p2.size AS size2015,
-- 1. calculate growth_perc
((p2.size - p1.size)/p1.size * 100.0) AS growth_perc
-- 2. From populations (alias as p1)
FROM countries.populations as p1
-- 3. Join to itself (alias as p2)
INNER JOIN countries.populations as p2
-- 4. Match on country code
ON p1.country_code = p2.country_code
-- 5. and year (with calculation)
AND p1.year = (p2.year - 5);
1
2
3
4
5
6
7
8
sql_stmt = "SELECT p1.size as size2010, p1.country_code, p2.size as size2015, \
((p2.size - p1.size)/p1.size * 100.0) AS growth_perc \
FROM countries.populations as p1 \
INNER JOIN countries.populations as p2 \
ON p1.country_code = p2.country_code \
AND p1.year = (p2.year - 5)"
pd.read_sql(sql_stmt, conn).head()
size2010 | country_code | size2015 | growth_perc | |
---|---|---|---|---|
0 | 101597.0 | ABW | 103889.0 | 2.255972 |
1 | 27962208.0 | AFG | 32526562.0 | 16.323297 |
2 | 21219954.0 | AGO | 25021974.0 | 17.917192 |
3 | 2913021.0 | ALB | 2889167.0 | -0.818875 |
4 | 84419.0 | AND | 70473.0 | -16.519977 |
Case when and then
Often it’s useful to look at a numerical field not as raw data, but instead as being in different categories or groups.
You can use CASE
with WHEN
, THEN
, ELSE
, and END
to define a new grouping field.
Instructions
Using the countries table, create a new field AS geosize_group that groups the countries into three groups:
- If
surface_area
is greater than 2 million,geosize_group
is'large'
. - If
surface_area
is greater than 350 thousand but not larger than 2 million,geosize_group
is'medium'
. - Otherwise,
geosize_group
is'small'
.
1
2
3
4
5
6
7
8
9
10
11
SELECT name, continent, code, surface_area,
-- 1. First case
CASE WHEN surface_area > 2000000 THEN 'large'
-- 2. Second case
WHEN surface_area > 350000 THEN 'medium'
-- 3. Else clause + end
ELSE 'small' END
-- 4. Alias name
AS geosize_group
-- 5. From table
FROM countries.countries;
1
2
3
4
5
6
7
8
sql_stmt = "SELECT name, continent, code, surface_area, \
CASE WHEN surface_area > 2000000 THEN 'large' \
WHEN surface_area > 350000 THEN 'medium' \
ELSE 'small' END \
AS geosize_group \
FROM countries.countries;"
pd.read_sql(sql_stmt, conn).head()
name | continent | code | surface_area | geosize_group | |
---|---|---|---|---|---|
0 | Afghanistan | Asia | AFG | 652090.0 | medium |
1 | Netherlands | Europe | NLD | 41526.0 | small |
2 | Albania | Europe | ALB | 28748.0 | small |
3 | Algeria | Africa | DZA | 2381740.0 | large |
4 | American Samoa | Oceania | ASM | 199.0 | small |
Inner challenge
The table you created with the added geosize_group
field has been loaded for you here with the name countries_plus
. Observe the use of (and the placement of) the INTO
command to create this countries_plus
table:
If you have downloaded the data from DataCamp and already have a schema for countries, countries_plus is already one of the tables
1
2
3
4
5
6
7
8
9
SELECT name, continent, code, surface_area,
CASE WHEN surface_area > 2000000
THEN 'large'
WHEN surface_area > 350000
THEN 'medium'
ELSE 'small' END
AS geosize_group
INTO countries_plus
FROM countries.countries;
You will now explore the relationship between the size of a country in terms of surface area and in terms of population using grouping fields created with CASE
.
By the end of this exercise, you’ll be writing two queries back-to-back in a single script. You got this!
Instructions 1/3
Using the populations
table focused only for the year
2015, create a new field AS popsize_group
to organize population size
into
'large'
(> 50 million),'medium'
(> 1 million), and'small'
groups.
Select only the country code, population size, and this new popsize_group
as fields.
1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT country_code, size,
-- 1. First case
CASE WHEN size > 50000000 THEN 'large'
-- 2. Second case
WHEN size > 1000000 THEN 'medium'
-- 3. Else clause + end
ELSE 'small' END
-- 4. Alias name
AS popsize_group
-- 5. From table
FROM countries.populations
-- 6. Focus on 2015
WHERE year = 2015;
1
2
3
4
5
6
7
8
9
10
11
sql_stmt = "\
SELECT country_code, size, \
CASE WHEN size > 50000000 THEN 'large' \
WHEN size > 1000000 THEN 'medium' \
ELSE 'small' END \
AS popsize_group \
FROM countries.populations \
WHERE year = 2015; \
"
pd.read_sql(sql_stmt, conn).head()
country_code | size | popsize_group | |
---|---|---|---|
0 | ABW | 103889.0 | small |
1 | AFG | 32526562.0 | medium |
2 | AGO | 25021974.0 | medium |
3 | ALB | 2889167.0 | medium |
4 | AND | 70473.0 | small |
Instructions 2/3
- Use
INTO
to save the result of the previous query aspop_plus
. You can see an example of this in thecountries_plus
code in the assignment text. Make sure to include a;
at the end of yourWHERE
clause! - Then, include another query below your first query to display all the records in
pop_plus
usingSELECT * FROM pop_plus
; so that you generate results and this will displaypop_plus
in query result.
Execute the first part on the PostgreSQL schema to create pop_plus
1
2
3
4
5
6
7
8
9
SELECT country_code, size,
CASE WHEN size > 50000000 THEN 'large'
WHEN size > 1000000 THEN 'medium'
ELSE 'small' END
AS popsize_group
-- 1. Into table
INTO countries.pop_plus
FROM populations
WHERE year = 2015;
Run this below
1
2
-- 2. Select all columns of pop_plus
SELECT * FROM countries.pop_plus;
1
2
3
4
5
6
7
8
9
10
11
sql_stmt = "\
SELECT country_code, size, \
CASE WHEN size > 50000000 THEN 'large' \
WHEN size > 1000000 THEN 'medium' \
ELSE 'small' END \
AS popsize_group \
INTO countries.pop_plus \
FROM countries.populations \
WHERE year = 2015;"
conn.execute(text(sql_stmt))
1
<sqlalchemy.engine.cursor.CursorResult at 0x27c993825f0>
1
2
3
4
5
sql_stmt = "\
SELECT * FROM countries.pop_plus; \
"
pd.read_sql(sql_stmt, conn).head()
country_code | size | popsize_group | |
---|---|---|---|
0 | ABW | 103889.0 | small |
1 | AFG | 32526562.0 | medium |
2 | AGO | 25021974.0 | medium |
3 | ALB | 2889167.0 | medium |
4 | AND | 70473.0 | small |
Instructions 3/3
- Keep the first query intact that creates
pop_plus
usingINTO
. - Write a query to join
countries_plus AS c
on the left withpop_plus AS p
on the right matching on the country code fields. - Sort the data based on
geosize_group
, in ascending order so thatlarge
appears on top. - Select the
name
,continent
,geosize_group
, andpopsize_group
fields.
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT c.name, c.continent, c.geosize_group, p.popsize_group \
FROM countries.countries_plus AS c \
INNER JOIN countries.pop_plus AS p \
ON c.code = p.country_code \
ORDER BY geosize_group ASC \
"
q_df = pd.read_sql(sql_stmt, conn)
q_df.head()
name | continent | geosize_group | popsize_group | |
---|---|---|---|---|
0 | India | Asia | large | large |
1 | United States | North America | large | large |
2 | Saudi Arabia | Asia | large | medium |
3 | China | Asia | large | large |
4 | Kazakhstan | Asia | large | medium |
1
q_df.tail()
name | continent | geosize_group | popsize_group | |
---|---|---|---|---|
201 | Guam | Oceania | small | small |
202 | Guyana | South America | small | small |
203 | Hong Kong | Asia | small | medium |
204 | Honduras | North America | small | medium |
205 | Croatia | Europe | small | medium |
Outer JOINs and Cross JOINs
In this chapter, you’ll come to grips with different kinds of outer joins. You’ll learn how to gain further insights into your data through left joins, right joins, and full joins. In addition to outer joins, you’ll also work with cross joins.
LEFT and RIGHT JOINs
- You can remember outer joins as reaching out to another table while keeping all of the records of the original table.
- Inner joins keep only the records in both tables.
- This chapter will explore three types of OUTER JOINs:
- LEFT JOINs
- RIGHT JOINs
- FULL JOINs
- How a LEFT JOIN differs from an INNER JOIN:
INNER JOIN
- The only records included in the resulting table of the INNER JOIN query were those in which the id field had matching values.
1
2
3
4
SELECT p1.country, prime_minister, president
FROM prime_ministers as p1
INNER JOIN presidents p2
ON p1.country = p2.country;
LEFT JOIN
- In contrast, a LEFT JOIN notes those record in the left table that do not have a match on the key field in the right table.
- This is denoted in the diagram by the open circles remaining close to the left table for id values of 2 and 3.
- Whereas the INNER JOIN kept just the records corresponding to id values 1 and 44, a LEFT JOIN keeps all of the original records in the left table, but then marks the values as missing in the right table for those that don’t have a match.
- The syntax of the LEFT JOIN is similar to that of the INNER JOIN.
1
2
3
4
SELECT p1.country, prime_minister, president
FROM prime_ministers as p1
LEFT JOIN presidents p2
ON p1.country = p2.country;
LEFT JOIN multiple matches
- It isn’t always the case that each key value in the left table corresponds to exactly one record in the key column of the right table.
- Duplicate rows are shown in the LEFT JOIN for id 1 since it has two matches corresponding to the values of R1 and R2 in the right2 table.
RIGHT JOIN
- Instead of matching entries in the id column on the left table to the id column on the right table, a RIGHT JOIN does the reverse.
- The resulting table from a RIGHT JOIN shows the missing entries in the L_val field.
- In the SQL statement the right table appears after RIGHT JOIN and the left table appears after FROM.
1
2
3
4
5
6
SELECT right_table.id AS R_id,
left_table.val AS L_val,
right_talbe.vale AS R_val
FROM left_table
RIGHT JOIN right_table
ON left_table.id = right_table.id;
INNER JOIN
1
2
3
4
5
6
7
8
sql_stmt = "\
SELECT p1.country, prime_minister, president \
FROM leaders.prime_ministers as p1 \
INNER JOIN leaders.presidents as p2 \
ON p1.country = p2.country \
"
pd.read_sql(sql_stmt, conn)
country | prime_minister | president | |
---|---|---|---|
0 | Egypt | Sherif Ismail | Abdel Fattah el-Sisi |
1 | Portugal | Antonio Costa | Marcelo Rebelo de Sousa |
2 | Vietnam | Nguyen Xuan Phuc | Tran Dai Quang |
3 | Haiti | Jack Guy Lafontant | Jovenel Moise |
LEFT JOIN
- The first four records are the same as those from INNER JOIN
- The following records correspond to the countries that do not have a president and thus their president values are missing.
1
2
3
4
5
6
7
8
sql_stmt = "\
SELECT p1.country, prime_minister, president \
FROM leaders.prime_ministers as p1 \
LEFT JOIN leaders.presidents as p2 \
ON p1.country = p2.country \
"
pd.read_sql(sql_stmt, conn)
country | prime_minister | president | |
---|---|---|---|
0 | Egypt | Sherif Ismail | Abdel Fattah el-Sisi |
1 | Portugal | Antonio Costa | Marcelo Rebelo de Sousa |
2 | Vietnam | Nguyen Xuan Phuc | Tran Dai Quang |
3 | Haiti | Jack Guy Lafontant | Jovenel Moise |
4 | India | Narendra Modi | None |
5 | Australia | Malcolm Turnbull | None |
6 | Norway | Erna Solberg | None |
7 | Brunei | Hassanal Bolkiah | None |
8 | Oman | Qaboos bin Said al Said | None |
9 | Spain | Mariano Rajoy | None |
Exercises
LEFT JOIN
Now you’ll explore the differences between performing an inner join and a left join using the cities
and countries
tables.
You’ll begin by performing an inner join with the cities
table on the left and the countries
table on the right. Remember to alias the name of the city field as city
and the name of the country field as country
.
You will then change the query to a left join. Take note of how many records are in each query here!
Instructions 1/2
- Fill in the code based on the instructions in the code comments to complete the inner join. Note how many records are in the result of the join in the query result tab.
1
2
3
4
5
6
7
8
9
10
11
12
13
-- Select the city name (with alias), the country code,
-- the country name (with alias), the region,
-- and the city proper population
SELECT c1.name AS city, code, c2.name AS country,
region, city_proper_pop
-- From left table (with alias)
FROM cities AS c1
-- Join to right table (with alias)
INNER JOIN countries AS c2
-- Match on country code
ON c1.country_code = c2.code
-- Order by descending country code
ORDER BY code DESC;
1
2
3
4
5
6
7
8
9
sql_stmt = "\
SELECT c1.name AS city, code, c2.name AS country, region, city_proper_pop \
FROM countries.cities AS c1 \
INNER JOIN countries.countries AS c2 \
ON c1.country_code = c2.code \
ORDER BY code DESC; \
"
pd.read_sql(sql_stmt, conn).head()
city | code | country | region | city_proper_pop | |
---|---|---|---|---|---|
0 | Harare | ZWE | Zimbabwe | Eastern Africa | 1606000.0 |
1 | Lusaka | ZMB | Zambia | Eastern Africa | 1742979.0 |
2 | Cape Town | ZAF | South Africa | Southern Africa | 3740026.0 |
3 | Johannesburg | ZAF | South Africa | Southern Africa | 4434827.0 |
4 | Durban | ZAF | South Africa | Southern Africa | 3442361.0 |
Instructions 2/2
- Change the code to perform a
LEFT JOIN
instead of anINNER JOIN
. After executing this query, note how many records the query result contains.
1
2
3
4
5
6
7
8
9
SELECT c1.name AS city, code, c2.name AS country,
region, city_proper_pop
FROM cities AS c1
-- 1. Join right table (with alias)
LEFT JOIN countries AS c2
-- 2. Match on country code
ON c1.country_code = c2.code
-- 3. Order by descending country code
ORDER BY code DESC;
1
2
3
4
5
6
7
8
9
sql_stmt = "\
SELECT c1.name AS city, code, c2.name AS country, region, city_proper_pop \
FROM countries.cities AS c1 \
LEFT JOIN countries.countries AS c2 \
ON c1.country_code = c2.code \
ORDER BY code DESC; \
"
pd.read_sql(sql_stmt, conn).head()
city | code | country | region | city_proper_pop | |
---|---|---|---|---|---|
0 | Taichung | None | None | None | 2752413.0 |
1 | Tainan | None | None | None | 1885252.0 |
2 | Kaohsiung | None | None | None | 2778918.0 |
3 | Bucharest | None | None | None | 1883425.0 |
4 | Taipei | None | None | None | 2704974.0 |
JEFT JOIN (2)
Next, you’ll try out another example comparing an inner join to its corresponding left join. Before you begin though, take note of how many records are in both the countries
and languages
tables below.
You will begin with an inner join on the countries
table on the left with the languages
table on the right. Then you’ll change the code to a left join in the next bullet.
Note the use of multi-line comments here using /*
and */
.
Instructions 1/2
- Perform an inner join. Alias the name of the
country
field as country and the name of thelanguage
field as language. - Sort based on desc
1
2
3
4
5
6
7
8
9
10
11
sql_stmt = "\
SELECT c.name AS country, local_name, l.name AS language, percent \
FROM countries.countries AS c \
INNER JOIN countries.languages AS l \
ON c.code = l.code \
ORDER BY country desc; \
"
res1 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res1)}')
res1.head()
1
Number of Records: 914
country | local_name | language | percent | |
---|---|---|---|---|
0 | Zimbabwe | Zimbabwe | Shona | NaN |
1 | Zimbabwe | Zimbabwe | Tonga | NaN |
2 | Zimbabwe | Zimbabwe | Tswana | NaN |
3 | Zimbabwe | Zimbabwe | Venda | NaN |
4 | Zimbabwe | Zimbabwe | Xhosa | NaN |
Instructions 2/2
- Perform a left join instead of an inner join. Observe the result, and also note the change in the number of records in the result.
- Carefully review which records appear in the left join result, but not in the inner join result.
1
2
3
4
5
6
7
8
9
10
11
sql_stmt = "\
SELECT c.name AS country, local_name, l.name AS language, percent \
FROM countries.countries AS c \
LEFT JOIN countries.languages AS l \
ON c.code = l.code \
ORDER BY country desc; \
"
res2 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res2)}')
res2.head()
1
Number of Records: 921
country | local_name | language | percent | |
---|---|---|---|---|
0 | Zimbabwe | Zimbabwe | Chibarwe | NaN |
1 | Zimbabwe | Zimbabwe | Shona | NaN |
2 | Zimbabwe | Zimbabwe | Ndebele | NaN |
3 | Zimbabwe | Zimbabwe | English | NaN |
4 | Zimbabwe | Zimbabwe | Chewa | NaN |
LEFT JOIN (3)
Left join (3) You’ll now revisit the use of the AVG()
function introduced in our Intro to SQL for Data Science course. You will use it in combination with left join to determine the average gross domestic product (GDP) per capita by region in 2010.
Instructions 1/3
- Begin with a left join with the
countries
table on the left and theeconomies
table on the right. - Focus only on records with 2010 as the
year
.
1
2
3
4
5
6
7
8
9
10
11
sql_stmt = "\
SELECT name, region, gdp_percapita \
FROM countries.countries AS c \
LEFT JOIN countries.economies AS e \
ON e.code = c.code \
WHERE year = 2010; \
"
res1 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res1)}')
res1.head()
1
Number of Records: 185
name | region | gdp_percapita | |
---|---|---|---|
0 | Afghanistan | Southern and Central Asia | 539.667 |
1 | Angola | Central Africa | 3599.270 |
2 | Albania | Southern Europe | 4098.130 |
3 | United Arab Emirates | Middle East | 34628.630 |
4 | Argentina | South America | 10412.950 |
Instructions 2/3
- Modify your code to calculate the average GDP per capita
AS avg_gdp
for each region in 2010. - Select the
region
andavg_gdp
fields.
1
2
3
4
5
6
7
8
9
10
11
12
13
sql_stmt = "\
SELECT region, AVG(gdp_percapita) as avg_gdp \
FROM countries.countries AS c \
LEFT JOIN countries.economies AS e \
ON e.code = c.code \
WHERE year = 2010 \
GROUP BY region \
ORDER BY avg_gdp DESC; \
"
res2 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res2)}')
res2.head()
1
Number of Records: 23
region | avg_gdp | |
---|---|---|
0 | Western Europe | 58130.961496 |
1 | Nordic Countries | 57073.997656 |
2 | North America | 47911.509766 |
3 | Australia and New Zealand | 44792.384766 |
4 | British Islands | 43588.330078 |
Instructions 3/3
- Arrange this data on average GDP per capita for each region in 2010 from highest to lowest average GDP per capita.
1
2
3
4
5
6
7
8
9
10
11
12
sql_stmt = "\
SELECT region, AVG(gdp_percapita) as avg_gdp \
FROM countries.countries AS c \
LEFT JOIN countries.economies AS e \
ON e.code = c.code \
WHERE year = 2010 \
GROUP BY region; \
"
res3 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res3)}')
res3.head()
1
Number of Records: 23
region | avg_gdp | |
---|---|---|
0 | Southern Africa | 5051.597974 |
1 | Caribbean | 11413.339454 |
2 | Eastern Africa | 1757.348162 |
3 | Southern Europe | 22926.410911 |
4 | Eastern Asia | 26205.851400 |
RIGHT JOIN
Right joins aren’t as common as left joins. One reason why is that you can always write a right join as a left join.
Instructions
The left join code is commented out here. Your task is to write a new query using rights joins that produces the same result as what the query using left joins produces. Keep this left joins code commented as you write your own query just below it using right joins to solve the problem.
- Note the order of the joins matters in your conversion to using right joins!
- convert this code to use RIGHT JOINs instead of LEFT JOINs
1
2
3
4
5
6
7
8
SELECT cities.name AS city, urbanarea_pop, countries.name AS country,
indep_year, languages.name AS language, percent
FROM cities
LEFT JOIN countries
ON cities.country_code = countries.code
LEFT JOIN languages
ON countries.code = languages.code
ORDER BY city, language;
1
2
3
4
5
6
7
8
9
10
11
12
sql_stmt = "\
SELECT cities.name AS city, urbanarea_pop, countries.name AS country, \
indep_year, languages.name AS language, percent \
FROM countries.languages \
RIGHT JOIN countries.countries \
ON languages.code = countries.code \
RIGHT JOIN countries.cities \
ON cities.country_code = countries.code \
ORDER BY city, language; \
"
pd.read_sql(sql_stmt, conn).head()
city | urbanarea_pop | country | indep_year | language | percent | |
---|---|---|---|---|---|---|
0 | Abidjan | 4765000.0 | Cote d'Ivoire | 1960.0 | French | NaN |
1 | Abidjan | 4765000.0 | Cote d'Ivoire | 1960.0 | Other | NaN |
2 | Abu Dhabi | 1145000.0 | United Arab Emirates | 1971.0 | Arabic | NaN |
3 | Abu Dhabi | 1145000.0 | United Arab Emirates | 1971.0 | English | NaN |
4 | Abu Dhabi | 1145000.0 | United Arab Emirates | 1971.0 | Hindi | NaN |
FULL JOINs
- The last of the three types of OUTER JOINs is the FULL JOIN
- Explore the difference between FULL JOIN and other JOINs
- The instruction will focus on comparing them to INNER JOINs and LEFT JOINs and then to LEFT JOINs and RIGHT JOINs.
- Let’s review how the diagram changes between and INNER JOIN and a LEFT JOIN for our basic example using the left and right tables.
- Then we’ll delve into the FULL JOIN diagram and is SQL code.
- Recall that an INNER JOIN keeps only the records that have matching key field values in both tables.
- A LEFT JOIN keeps all of the records in the left table while bringing in missing values for those key field values that don’t appear in the right table.
- Let’s review the differences between a LEFT JOIN and a RIGHT JOIN.
- The id values of 2 and 3 in the left table do not match with the id values in the right table, so missing values are brought in for them in the LEFT JOIN.
- Likewise for the RIGHT JOIN, missing values are brought in for id values of 5 and 6.
- A FULL JOIN combines a LEFT JOIN and RIGHT JOIN as you can see in the diagram.
- It will bring in all record from both the left and the right table and keep track of the missing values accordingly.
- Note the missing values here and all six of the values of id are included in the table.
- You can also see from the SQL code, to produce this FULL JOIN result, the general format aligns closely with the SQL syntax seen for an INNER JOIN and a LEFT JOIN.
1
2
3
4
5
6
7
SELECT left_table.id AS L_id,
right_table.id AS R_id,
left_table.val AS L_val,
right_table.val as R_val,
FULL left_table
FULL JOIN right_table
USING (id);
FULL JOIN example using leaders database
- Let’s revisit the example of looking at countries with prime ministers and / or presidents.
- Query breakdown:
- The SELECT statement includes the country field from both tables of interest and also the prime_minister and president fields.
- The left table is specified as prime_ministers with the alias of p1
- The order matters and if you switched the two tables, the output would be slightly different.
- The right table is specified as presidents with the alias of p2
- The join is done based on the key field of country in both tables
1
2
3
4
SELECT p1.country AS pm_co, p2.country AS pres_co, prime_minister, president
FROM prime_ministers AS p1
FULL JOIN presidents AS p2
ON p1.country = p2.country;
1
2
3
4
5
6
7
8
sql_stmt = "\
SELECT p1.country AS pm_co, p2.country AS pres_co, prime_minister, president \
FROM leaders.prime_ministers AS p1 \
FULL JOIN leaders.presidents AS p2 \
ON p1.country = p2.country; \
"
pd.read_sql(sql_stmt, conn)
pm_co | pres_co | prime_minister | president | |
---|---|---|---|---|
0 | Egypt | Egypt | Sherif Ismail | Abdel Fattah el-Sisi |
1 | Portugal | Portugal | Antonio Costa | Marcelo Rebelo de Sousa |
2 | Vietnam | Vietnam | Nguyen Xuan Phuc | Tran Dai Quang |
3 | Haiti | Haiti | Jack Guy Lafontant | Jovenel Moise |
4 | India | None | Narendra Modi | None |
5 | Australia | None | Malcolm Turnbull | None |
6 | Norway | None | Erna Solberg | None |
7 | Brunei | None | Hassanal Bolkiah | None |
8 | Oman | None | Qaboos bin Said al Said | None |
9 | Spain | None | Mariano Rajoy | None |
10 | None | Uruguay | None | Jose Mujica |
11 | None | Chile | None | Michelle Bachelet |
12 | None | Liberia | None | Ellen Johnson Sirleaf |
Exercises
FULL JOIN
In this exercise, you’ll examine how your results differ when using a full join versus using a left join versus using an inner join with the countries
and currencies
tables.
You will focus on the North American region
and also where the name
of the country is missing. Dig in to see what we mean!
Begin with a full join with countries
on the left and currencies
on the right. The fields of interest have been SELECT
ed for you throughout this exercise.
Then complete a similar left join and conclude with an inner join.
Instructions 1/3
- Choose records in which region corresponds to North America or is NULL.
1
2
3
4
5
6
7
8
9
10
11
SELECT name AS country, code, region, basic_unit
-- 3. From countries
FROM countries
-- 4. Join to currencies
FULL JOIN currencies
-- 5. Match on code
USING (code)
-- 1. Where region is North America or null
WHERE region = 'North America' OR region IS null
-- 2. Order by region
ORDER BY region;
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT name AS country, code, region, basic_unit \
FROM countries.countries \
FULL JOIN countries.currencies \
USING (code) \
WHERE region = 'North America' OR region IS null \
ORDER BY region; \
"
pd.read_sql(sql_stmt, conn)
country | code | region | basic_unit | |
---|---|---|---|---|
0 | Canada | CAN | North America | Canadian dollar |
1 | United States | USA | North America | United States dollar |
2 | Bermuda | BMU | North America | Bermudian dollar |
3 | Greenland | GRL | North America | None |
4 | None | TMP | None | United States dollar |
5 | None | FLK | None | Falkland Islands pound |
6 | None | AIA | None | East Caribbean dollar |
7 | None | NIU | None | New Zealand dollar |
8 | None | ROM | None | Romanian leu |
9 | None | SHN | None | Saint Helena pound |
10 | None | SGS | None | British pound |
11 | None | TWN | None | New Taiwan dollar |
12 | None | WLF | None | CFP franc |
13 | None | MSR | None | East Caribbean dollar |
14 | None | IOT | None | United States dollar |
15 | None | CCK | None | Australian dollar |
16 | None | COK | None | New Zealand dollar |
Instructions 2/3
- Repeat the same query as above but use a
LEFT JOIN
instead of aFULL JOIN
. Note what has changed compared to theFULL JOIN
result!
1
2
3
4
5
6
7
8
9
10
11
SELECT name AS country, code, region, basic_unit
-- 3. From countries
FROM countries
-- 4. Join to currencies
LEFT JOIN currencies
-- 5. Match on code
USING (code)
-- 1. Where region is North America or null
WHERE region = 'North America' OR region IS null
-- 2. Order by region
ORDER BY region;
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT name AS country, code, region, basic_unit \
FROM countries.countries \
LEFT JOIN countries.currencies \
USING (code) \
WHERE region = 'North America' OR region IS null \
ORDER BY region; \
"
pd.read_sql(sql_stmt, conn)
country | code | region | basic_unit | |
---|---|---|---|---|
0 | Bermuda | BMU | North America | Bermudian dollar |
1 | Canada | CAN | North America | Canadian dollar |
2 | United States | USA | North America | United States dollar |
3 | Greenland | GRL | North America | None |
Instruction 3/3
- Repeat the same query as above but use an
INNER JOIN
instead of aFULL JOIN
. Note what has changed compared to theFULL JOIN
andLEFT JOIN
results!
1
2
3
4
5
6
7
8
9
10
11
SELECT name AS country, code, region, basic_unit
-- 3. From countries
FROM countries
-- 4. Join to currencies
INNER JOIN currencies
-- 5. Match on code
USING (code)
-- 1. Where region is North America or null
WHERE region = 'North America' OR region IS null
-- 2. Order by region
ORDER BY region;
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT name AS country, code, region, basic_unit \
FROM countries.countries \
INNER JOIN countries.currencies \
USING (code) \
WHERE region = 'North America' OR region IS null \
ORDER BY region; \
"
pd.read_sql(sql_stmt, conn)
country | code | region | basic_unit | |
---|---|---|---|---|
0 | Bermuda | BMU | North America | Bermudian dollar |
1 | Canada | CAN | North America | Canadian dollar |
2 | United States | USA | North America | United States dollar |
Have you kept an eye out on the different numbers of records these queries returned? The FULL JOIN
query returned 17 rows, the LEFT JOIN
returned 4 rows, and the INNER JOIN
only returned 3 rows. Do these results make sense to you?
FULL JOIN (2)
You’ll now investigate a similar exercise to the last one, but this time focused on using a table with more records on the left than the right. You’ll work with the languages
and countries
tables.
Begin with a full join with languages
on the left and countries
on the right. Appropriate fields have been selected for you again here.
Instructions 1/3
- Choose records in which
countries.name
starts with the capital letter'V'
or isNULL
and arrange bycountries.name
in ascending order to more clearly see the results.
1
2
3
4
5
6
7
8
9
10
11
SELECT countries.name, code, languages.name AS language
-- 3. From languages
FROM languages
-- 4. Join to countries
FULL JOIN countries
-- 5. Match on code
USING (code)
-- 1. Where countries.name starts with V or is null
WHERE countries.name LIKE 'V%' OR countries.name IS null
-- 2. Order by ascending countries.name
ORDER BY countries.name;
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT countries.name, code, languages.name AS language \
FROM countries.languages \
FULL JOIN countries.countries \
USING (code) \
WHERE countries.name LIKE 'V%%' OR countries.name IS null \
ORDER BY countries.name; \
"
pd.read_sql(sql_stmt, conn)
name | code | language | |
---|---|---|---|
0 | Vanuatu | VUT | Tribal Languages |
1 | Vanuatu | VUT | English |
2 | Vanuatu | VUT | French |
3 | Vanuatu | VUT | Other |
4 | Vanuatu | VUT | Bislama |
5 | Venezuela | VEN | Spanish |
6 | Venezuela | VEN | indigenous |
7 | Vietnam | VNM | Vietnamese |
8 | Vietnam | VNM | English |
9 | Vietnam | VNM | Other |
10 | Virgin Islands, British | VGB | None |
11 | Virgin Islands, U.S. | VIR | None |
12 | None | NIU | English |
13 | None | NIU | Niuean |
14 | None | NIU | Other |
15 | None | NFK | English |
16 | None | NFK | Other |
17 | None | ROM | Romanian |
18 | None | ROM | Hungarian |
19 | None | ROM | Romani |
20 | None | ROM | Other |
21 | None | ROM | unspecified |
22 | None | SPM | French |
23 | None | TWN | Mandarin |
24 | None | TWN | Taiwanese |
25 | None | TWN | Hakka |
26 | None | TKL | Tokelauan |
27 | None | TKL | English |
28 | None | TKL | Samoan |
29 | None | TKL | Tuvaluan |
30 | None | TKL | Kiribati |
31 | None | TKL | Other |
32 | None | AIA | English |
33 | None | TKL | unspecified |
34 | None | WLF | Wallisian |
35 | None | WLF | Futunian |
36 | None | WLF | French |
37 | None | WLF | Other |
38 | None | ESH | Standard |
39 | None | ESH | Hassaniya |
40 | None | ESH | Moroccan |
41 | None | TKL | none |
42 | None | CXR | English |
43 | None | CXR | Chinese |
44 | None | CXR | Malay |
45 | None | CCK | Malay |
46 | None | CCK | English |
47 | None | COK | English |
48 | None | COK | Rarotongan |
49 | None | COK | Other |
50 | None | MSR | English |
51 | None | NIU | Niuean |
52 | None | NIU | Niuean |
Instructions 2/3
- Repeat the same query as above but use a
left join
instead of a full join. Note what has changed compared to the full join result!
1
2
3
4
5
6
7
8
9
10
11
SELECT countries.name, code, languages.name AS language
-- 3. From languages
FROM languages
-- 4. Join to countries
LEFT JOIN countries
-- 5. Match on code
USING (code)
-- 1. Where countries.name starts with V or is null
WHERE countries.name LIKE 'V%' OR countries.name IS null
-- 2. Order by ascending countries.name
ORDER BY countries.name;
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT countries.name, code, languages.name AS language \
FROM countries.languages \
LEFT JOIN countries.countries \
USING (code) \
WHERE countries.name LIKE 'V%%' OR countries.name IS null \
ORDER BY countries.name; \
"
pd.read_sql(sql_stmt, conn)
name | code | language | |
---|---|---|---|
0 | Vanuatu | VUT | English |
1 | Vanuatu | VUT | Other |
2 | Vanuatu | VUT | French |
3 | Vanuatu | VUT | Tribal Languages |
4 | Vanuatu | VUT | Bislama |
5 | Venezuela | VEN | indigenous |
6 | Venezuela | VEN | Spanish |
7 | Vietnam | VNM | English |
8 | Vietnam | VNM | Vietnamese |
9 | Vietnam | VNM | Other |
10 | None | NIU | Niuean |
11 | None | NIU | Niuean |
12 | None | NIU | English |
13 | None | NIU | Niuean |
14 | None | NIU | Other |
15 | None | NFK | English |
16 | None | NFK | Other |
17 | None | ROM | Romanian |
18 | None | ROM | Hungarian |
19 | None | ROM | Romani |
20 | None | ROM | Other |
21 | None | ROM | unspecified |
22 | None | SPM | French |
23 | None | TWN | Mandarin |
24 | None | TWN | Taiwanese |
25 | None | TWN | Hakka |
26 | None | TKL | Tokelauan |
27 | None | TKL | English |
28 | None | TKL | Samoan |
29 | None | TKL | Tuvaluan |
30 | None | TKL | Kiribati |
31 | None | TKL | Other |
32 | None | TKL | none |
33 | None | TKL | unspecified |
34 | None | WLF | Wallisian |
35 | None | WLF | Futunian |
36 | None | WLF | French |
37 | None | WLF | Other |
38 | None | ESH | Standard |
39 | None | ESH | Hassaniya |
40 | None | AIA | English |
41 | None | ESH | Moroccan |
42 | None | CXR | English |
43 | None | CXR | Chinese |
44 | None | CXR | Malay |
45 | None | CCK | Malay |
46 | None | CCK | English |
47 | None | COK | English |
48 | None | COK | Rarotongan |
49 | None | COK | Other |
50 | None | MSR | English |
Instructions 3/3
- Repeat once more, but use an
inner join
instead of a left join. Note what has changed compared to the full join and left join results.
1
2
3
4
5
6
7
8
9
10
11
SELECT countries.name, code, languages.name AS language
-- 3. From languages
FROM languages
-- 4. Join to countries
INNER JOIN countries
-- 5. Match on code
USING (code)
-- 1. Where countries.name starts with V or is null
WHERE countries.name LIKE 'V%' OR countries.name IS null
-- 2. Order by ascending countries.name
ORDER BY countries.name;
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT countries.name, code, languages.name AS language \
FROM countries.languages \
INNER JOIN countries.countries \
USING (code) \
WHERE countries.name LIKE 'V%%' OR countries.name IS null \
ORDER BY countries.name; \
"
pd.read_sql(sql_stmt, conn)
name | code | language | |
---|---|---|---|
0 | Vanuatu | VUT | Tribal Languages |
1 | Vanuatu | VUT | Bislama |
2 | Vanuatu | VUT | English |
3 | Vanuatu | VUT | French |
4 | Vanuatu | VUT | Other |
5 | Venezuela | VEN | Spanish |
6 | Venezuela | VEN | indigenous |
7 | Vietnam | VNM | Vietnamese |
8 | Vietnam | VNM | English |
9 | Vietnam | VNM | Other |
FULL JOIN (3)
You’ll now explore using two consecutive full joins on the three tables you worked with in the previous two exercises.
Instructions
- Complete a full join with
countries
on the left andlanguages
on the right. - Next, full join this result with
currencies
on the right. - Use
LIKE
to choose the Melanesia and Micronesia regions (Hint:'M%esia'
). - Select the fields corresponding to the country name
AS country
, region, language nameAS language
, and basic and fractional units of currency.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- 7. Select fields (with aliases)
SELECT c1.name AS country, region, l.name AS language,
basic_unit, frac_unit
-- 1. From countries (alias as c1)
FROM countries AS c1
-- 2. Join with languages (alias as l)
FULL JOIN languages AS l
-- 3. Match on code
USING (code)
-- 4. Join with currencies (alias as c2)
FULL JOIN currencies AS c2
-- 5. Match on code
USING (code)
-- 6. Where region like Melanesia and Micronesia
WHERE region LIKE 'M%esia';
1
2
3
4
5
6
7
8
9
10
11
12
sql_stmt = "\
SELECT c1.name AS country, region, l.name AS language, \
basic_unit, frac_unit \
FROM countries.countries AS c1 \
FULL JOIN countries.languages AS l \
USING (code) \
FULL JOIN countries.currencies AS c2 \
USING (code) \
WHERE region LIKE 'M%%esia'; \
"
pd.read_sql(sql_stmt, conn)
country | region | language | basic_unit | frac_unit | |
---|---|---|---|---|---|
0 | Kiribati | Micronesia | English | Australian dollar | Cent |
1 | Kiribati | Micronesia | Kiribati | Australian dollar | Cent |
2 | Marshall Islands | Micronesia | Other | United States dollar | Cent |
3 | Marshall Islands | Micronesia | Marshallese | United States dollar | Cent |
4 | Nauru | Micronesia | Other | Australian dollar | Cent |
5 | Nauru | Micronesia | English | Australian dollar | Cent |
6 | Nauru | Micronesia | Nauruan | Australian dollar | Cent |
7 | New Caledonia | Melanesia | Other | CFP franc | Centime |
8 | New Caledonia | Melanesia | French | CFP franc | Centime |
9 | Palau | Micronesia | Other | United States dollar | Cent |
10 | Palau | Micronesia | Chinese | United States dollar | Cent |
11 | Palau | Micronesia | Filipino | United States dollar | Cent |
12 | Palau | Micronesia | English | United States dollar | Cent |
13 | Palau | Micronesia | Other | United States dollar | Cent |
14 | Palau | Micronesia | Palauan | United States dollar | Cent |
15 | Papua New Guinea | Melanesia | Other | Papua New Guinean kina | Toea |
16 | Papua New Guinea | Melanesia | Hiri | Papua New Guinean kina | Toea |
17 | Papua New Guinea | Melanesia | English | Papua New Guinean kina | Toea |
18 | Papua New Guinea | Melanesia | Tok Pisin | Papua New Guinean kina | Toea |
19 | Solomon Islands | Melanesia | indigenous | Solomon Islands dollar | Cent |
20 | Solomon Islands | Melanesia | English | Solomon Islands dollar | Cent |
21 | Solomon Islands | Melanesia | Melanesian pidgin | Solomon Islands dollar | Cent |
22 | Vanuatu | Melanesia | Other | Vanuatu vatu | None |
23 | Vanuatu | Melanesia | French | Vanuatu vatu | None |
24 | Vanuatu | Melanesia | English | Vanuatu vatu | None |
25 | Vanuatu | Melanesia | Bislama | Vanuatu vatu | None |
26 | Vanuatu | Melanesia | Tribal Languages | Vanuatu vatu | None |
27 | Micronesia, Federated States of | Micronesia | Kapingamarangi | None | None |
28 | Micronesia, Federated States of | Micronesia | Nukuoro | None | None |
29 | Micronesia, Federated States of | Micronesia | Woleaian | None | None |
30 | Micronesia, Federated States of | Micronesia | Ulithian | None | None |
31 | Micronesia, Federated States of | Micronesia | Yapese | None | None |
32 | Micronesia, Federated States of | Micronesia | Pohnpeian | None | None |
33 | Micronesia, Federated States of | Micronesia | Kosrean | None | None |
34 | Micronesia, Federated States of | Micronesia | Chuukese | None | None |
35 | Micronesia, Federated States of | Micronesia | English | None | None |
36 | Fiji Islands | Melanesia | None | None | None |
37 | Northern Mariana Islands | Micronesia | Other | None | None |
38 | Northern Mariana Islands | Micronesia | Other Asian | None | None |
39 | Northern Mariana Islands | Micronesia | Chinese | None | None |
40 | Northern Mariana Islands | Micronesia | Other Pacific Island | None | None |
41 | Northern Mariana Islands | Micronesia | English | None | None |
42 | Northern Mariana Islands | Micronesia | Chamorro | None | None |
43 | Northern Mariana Islands | Micronesia | Philippine | None | None |
44 | Guam | Micronesia | Other | None | None |
45 | Guam | Micronesia | Asian | None | None |
46 | Guam | Micronesia | Other Pacific Islander | None | None |
47 | Guam | Micronesia | Chamorro | None | None |
48 | Guam | Micronesia | Filipino | None | None |
49 | Guam | Micronesia | English | None | None |
Review OUTER JOINs
A(n) ___
join is a join combining the results of a ___
join and a ___
join.
Answer the question
left, full, rightright, full, leftinner, left, right- **None of the above are true**
CROSSing the rubicon
- CROSS JOINs create all possible combinations of two tables.
- The resulting table is comprised of all 9 combinations if
id
fromtable1
andid
fromtable2
(e.g. 1(A-C), 2(A-C), & 3(A-C))
CROSS JOIN example: Pairing prime ministers with presidents
- Suppose all prime ministers in North America and Oceania in the
prime_ministers
table are scheduled for individual meetings with all presidents in the presidents table. - All the combinations can be created with a CROSS JOIN
1
2
3
4
SELECT prime_minister, president
FROM prime_ministers AS p1
CROSS JOIN presidents AS p2
WHERE p1.continent IN ('North America', 'Oceania');
1
2
3
4
5
6
7
8
sql_stmt = "\
SELECT prime_minister, president \
FROM leaders.prime_ministers AS p1 \
CROSS JOIN leaders.presidents AS p2 \
WHERE p1.continent IN ('North America', 'Oceania'); \
"
pd.read_sql(sql_stmt, conn)
prime_minister | president | |
---|---|---|
0 | Jack Guy Lafontant | Abdel Fattah el-Sisi |
1 | Malcolm Turnbull | Abdel Fattah el-Sisi |
2 | Jack Guy Lafontant | Marcelo Rebelo de Sousa |
3 | Malcolm Turnbull | Marcelo Rebelo de Sousa |
4 | Jack Guy Lafontant | Jovenel Moise |
5 | Malcolm Turnbull | Jovenel Moise |
6 | Jack Guy Lafontant | Jose Mujica |
7 | Malcolm Turnbull | Jose Mujica |
8 | Jack Guy Lafontant | Ellen Johnson Sirleaf |
9 | Malcolm Turnbull | Ellen Johnson Sirleaf |
10 | Jack Guy Lafontant | Michelle Bachelet |
11 | Malcolm Turnbull | Michelle Bachelet |
12 | Jack Guy Lafontant | Tran Dai Quang |
13 | Malcolm Turnbull | Tran Dai Quang |
Exercises
A table of two cities
This exercise looks to explore languages potentially and most frequently spoken in the cities of Hyderabad, India and Hyderabad, Pakistan.
You will begin with a cross join with cities AS c
on the left and languages AS l
on the right. Then you will modify the query using an inner join in the next tab.
Instructions 1/2
- Create the cross join as described above. (Recall that cross joins do not use
ON
orUSING
.) - Make use of
LIKE
andHyder%
to choose Hyderabad in both countries. - Select only the city name
AS city
and language nameAS language
.
1
2
3
4
5
6
7
8
-- 4. Select fields
SELECT c.name AS city, l.name AS language
-- 1. From cities (alias as c)
FROM cities AS c
-- 2. Join to languages (alias as l)
CROSS JOIN languages AS l
-- 3. Where c.name like Hyderabad
WHERE c.name LIKE 'Hyder%';
1
2
3
4
5
6
7
8
9
sql_stmt = "\
SELECT c.name AS city, l.name AS language \
FROM countries.cities AS c \
CROSS JOIN countries.languages AS l \
WHERE c.name LIKE 'Hyder%%'; \
"
hyderabad_lang = pd.read_sql(sql_stmt, conn)
hyderabad_lang
city | language | |
---|---|---|
0 | Hyderabad (India) | Dari |
1 | Hyderabad | Dari |
2 | Hyderabad (India) | Pashto |
3 | Hyderabad | Pashto |
4 | Hyderabad (India) | Turkic |
... | ... | ... |
1905 | Hyderabad | Tswana |
1906 | Hyderabad (India) | Venda |
1907 | Hyderabad | Venda |
1908 | Hyderabad (India) | Xhosa |
1909 | Hyderabad | Xhosa |
1910 rows × 2 columns
1
2
unique_lang = hyderabad_lang['language'].unique()
print(len(unique_lang))
1
396
Instructions 2/2
- Use an inner join instead of a cross join. Think about what the difference will be in the results for this inner join result and the one for the cross join.
1
2
3
4
5
6
7
8
9
10
-- 5. Select fields
SELECT c.name AS city, l.name AS language
-- 1. From cities (alias as c)
FROM cities as c
-- 2. Join to languages (alias as l)
INNER JOIN languages AS l
-- 3. Match on country code
ON c.country_code = l.code
-- 4. Where c.name like Hyderabad
WHERE c.name like 'Hyder%';
1
2
3
4
5
6
7
8
9
sql_stmt = "\
SELECT c.name AS city, l.name AS language \
FROM countries.cities AS c \
INNER JOIN countries.languages AS l \
ON c.country_code = l.code \
WHERE c.name LIKE 'Hyder%%'; \
"
pd.read_sql(sql_stmt, conn)
city | language | |
---|---|---|
0 | Hyderabad (India) | Hindi |
1 | Hyderabad (India) | Bengali |
2 | Hyderabad (India) | Telugu |
3 | Hyderabad (India) | Marathi |
4 | Hyderabad (India) | Tamil |
5 | Hyderabad (India) | Urdu |
6 | Hyderabad (India) | Gujarati |
7 | Hyderabad (India) | Kannada |
8 | Hyderabad (India) | Malayalam |
9 | Hyderabad (India) | Oriya |
10 | Hyderabad (India) | Punjabi |
11 | Hyderabad (India) | Assamese |
12 | Hyderabad (India) | Maithili |
13 | Hyderabad (India) | Other |
14 | Hyderabad | Punjabi |
15 | Hyderabad | Sindhi |
16 | Hyderabad | Saraiki |
17 | Hyderabad | Pashto |
18 | Hyderabad | Urdu |
19 | Hyderabad | Balochi |
20 | Hyderabad | Hindko |
21 | Hyderabad | Brahui |
22 | Hyderabad | English |
23 | Hyderabad | Burushaski |
24 | Hyderabad | Other |
Outer challenge
Now that you’re fully equipped to use outer joins, try a challenge problem to test your knowledge!
In terms of life expectancy for 2010, determine the names of the lowest five countries and their regions.
Instructions
- Select country name
AS country
,region
, and life expectancyAS life_exp
. - Make sure to use
LEFT JOIN
,WHERE
,ORDER BY
, andLIMIT
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Select fields
SELECT c.name AS country, c.region, p.life_expectancy AS life_exp
-- From countries (alias as c)
FROM countries AS c
-- Join to populations (alias as p)
LEFT JOIN populations AS p
-- Match on country code
ON c.code = p.country_code
-- Focus on 2010
WHERE p.year = 2010
-- Order by life_exp
ORDER BY life_exp
-- Limit to 5 records
LIMIT 5;
1
2
3
4
5
6
7
8
9
10
11
sql_stmt = "\
SELECT c.name AS country, c.region, p.life_expectancy AS life_exp \
FROM countries.countries AS c \
LEFT JOIN countries.populations AS p \
ON c.code = p.country_code \
WHERE p.year = 2010 \
ORDER BY life_exp \
LIMIT 5; \
"
pd.read_sql(sql_stmt, conn)
country | region | life_exp | |
---|---|---|---|
0 | Lesotho | Southern Africa | 47.483414 |
1 | Central African Republic | Central Africa | 47.625317 |
2 | Sierra Leone | Western Africa | 48.228950 |
3 | Swaziland | Southern Africa | 48.345757 |
4 | Zimbabwe | Eastern Africa | 49.574657 |
Set theory clauses
In this chapter, you’ll learn more about set theory using Venn diagrams and you will be introduced to union, union all, intersect, and except clauses. You’ll finish by investigating semi-joins and anti-joins, which provide a nice introduction to subqueries.
State of the UNION
- Focus on the operations UNION and UNION ALL.
- In addition to joining diagrams, you’ll see how Venn diagrams can be used to represent set operations.
- Think of each circle as representing a table of data
- The shading represents what’s included in the result of the set operation from each table.
UNION
includes every record in both tables, but DOES NOT double count those that are in both tables.UNION ALL
includes every record in both tables and DOES replicate those that are in both tables, represented by the black center- The two diagrams on the bottom represent only the subsets of data being selected.
INTERSECT
results in only those records found in both of the tables.EXCEPT
results in only those records in one table, BUT NOT the other.- Given two tables with name
left_one
andright_one
,one
corresponds to each table having one field. - If you run a
UNION
on these two fields, the result is each record appearing in either table, but notice the id values of1
and4
inright_one
, are not included again in theUNION
since they were already found in theleft_one
table. UNION ALL
includes all duplicates in its result, resulting in 8 total records for the example.
UNION & UNION ALL example
monarchs
table in theleaders
database- Use
UNION
on theprime_ministers
andmonarchs
tables - all prime ministers and monarchs
1
2
3
4
5
6
SELECT prime_minister AS leader, country
FROM leaders.prime_ministers
UNION
SELECT monarch, country
FROM leaders.monarchs
ORDER BY country;
- Note that the
prime_minister
field has been aliased as leader. The resulting field from theUNION
will have the name leader. - This is an important property of set theory clauses
- The fields included in the operation must be of the same data type since they are returned as a single field.
- A number field can’t be stacked on top of a character field.
- Spain and Norway have a prime minister and a monarch, while Brunei and Oman have a monarch who also acts as a prime minister.
1
2
3
4
5
6
SELECT prime_minister AS leader, country
FROM leaders.prime_ministers
UNION ALL
SELECT monarch, country
FROM leaders.monarchs
ORDER BY country;
UNION
andUNION ALL
clauses do not do the lookup step thatJOIN
s do, they stack records on top of each other from one table to the next.
1
2
3
4
5
6
sql_stmt = "\
SELECT * \
FROM leaders.monarchs; \
"
pd.read_sql(sql_stmt, conn)
country | continent | monarch | |
---|---|---|---|
0 | Brunei | Asia | Hassanal Bolkiah |
1 | Oman | Asia | Qaboos bin Said al Said |
2 | Norway | Europe | Harald V |
3 | Spain | Europe | Felipe VI |
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT prime_minister AS leader, country \
FROM leaders.prime_ministers \
UNION \
SELECT monarch, country \
FROM leaders.monarchs \
ORDER BY country; \
"
pd.read_sql(sql_stmt, conn)
leader | country | |
---|---|---|
0 | Malcolm Turnbull | Australia |
1 | Hassanal Bolkiah | Brunei |
2 | Sherif Ismail | Egypt |
3 | Jack Guy Lafontant | Haiti |
4 | Narendra Modi | India |
5 | Erna Solberg | Norway |
6 | Harald V | Norway |
7 | Qaboos bin Said al Said | Oman |
8 | Antonio Costa | Portugal |
9 | Mariano Rajoy | Spain |
10 | Felipe VI | Spain |
11 | Nguyen Xuan Phuc | Vietnam |
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT prime_minister AS leader, country \
FROM leaders.prime_ministers \
UNION ALL \
SELECT monarch, country \
FROM leaders.monarchs \
ORDER BY country; \
"
pd.read_sql(sql_stmt, conn)
leader | country | |
---|---|---|
0 | Malcolm Turnbull | Australia |
1 | Hassanal Bolkiah | Brunei |
2 | Hassanal Bolkiah | Brunei |
3 | Sherif Ismail | Egypt |
4 | Jack Guy Lafontant | Haiti |
5 | Narendra Modi | India |
6 | Erna Solberg | Norway |
7 | Harald V | Norway |
8 | Qaboos bin Said al Said | Oman |
9 | Qaboos bin Said al Said | Oman |
10 | Antonio Costa | Portugal |
11 | Felipe VI | Spain |
12 | Mariano Rajoy | Spain |
13 | Nguyen Xuan Phuc | Vietnam |
Exercises
UNION
Near query result to the right, you will see two new tables with names economies2010
and economies2015
.
Instructions
- Combine these two tables into one table containing all of the fields in
economies2010
. Theeconomies
table is also included for reference. - Sort this resulting single table by country code and then by year, both in ascending order.
1
2
3
4
5
6
7
8
9
10
11
12
-- Select fields from 2010 table
SELECT *
-- From 2010 table
FROM countries.economies2010
-- Set theory clause
UNION
-- Select fields from 2015 table
SELECT *
-- From 2015 table
FROM countries.economies2015
-- Order by code and year
ORDER BY code, year;
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT * \
FROM countries.economies2010 \
UNION \
SELECT * \
FROM countries.economies2015 \
ORDER BY code, year; \
"
pd.read_sql(sql_stmt, conn)
code | year | income_group | gross_savings | |
---|---|---|---|---|
0 | AFG | 2010 | Low income | 37.133 |
1 | AFG | 2015 | Low income | 21.466 |
2 | AGO | 2010 | Upper middle income | 23.534 |
3 | AGO | 2015 | Upper middle income | -0.425 |
4 | ALB | 2010 | Upper middle income | 20.011 |
... | ... | ... | ... | ... |
375 | ZAF | 2015 | Upper middle income | 16.460 |
376 | ZMB | 2010 | Lower middle income | 37.404 |
377 | ZMB | 2015 | Lower middle income | 39.177 |
378 | ZWE | 2010 | Low income | 16.109 |
379 | ZWE | 2015 | Low income | 5.563 |
380 rows × 4 columns
UNION (2)
UNION
can also be used to determine all occurrences of a field across multiple tables. Try out this exercise with no starter code.
Instructions
- Determine all (non-duplicated) country codes in either the
cities
or thecurrencies
table. The result should be a table with only one field calledcountry_code
. - Sort by
country_code
in alphabetical order.
1
2
3
4
5
6
7
8
9
10
11
12
-- Select field
SELECT country_code
-- From cities
FROM countries.cities
-- Set theory clause
UNION
-- Select field
SELECT code
-- From currencies
FROM countries.currencies
-- Order by country_code
ORDER BY country_code;
1
2
3
4
5
6
7
8
9
10
11
sql_stmt = "\
SELECT country_code \
FROM countries.cities \
UNION \
SELECT code \
FROM countries.currencies \
ORDER BY country_code; \
"
country_codes = pd.read_sql(sql_stmt, conn)
country_codes.head()
country_code | |
---|---|
0 | ABW |
1 | AFG |
2 | AGO |
3 | AIA |
4 | ALB |
1
country_codes.tail()
country_code | |
---|---|
200 | WSM |
201 | YEM |
202 | ZAF |
203 | ZMB |
204 | ZWE |
UNION ALL
As you saw, duplicates were removed from the previous two exercises by using UNION
.
To include duplicates, you can use UNION ALL
.
Instructions
- Determine all combinations (include duplicates) of country code and year that exist in either the
economies
or thepopulations
tables. Order bycode
thenyear
. - The result of the query should only have two columns/fields. Think about how many records this query should result in.
- You’ll use code very similar to this in your next exercise after the video. Make note of this code after completing it.
1
2
3
4
5
6
7
8
9
10
11
12
-- Select fields
SELECT code, year
-- From economies
FROM countries.economies
-- Set theory clause
UNION ALL
-- Select fields
SELECT country_code, year
-- From populations
FROM countries.populations
-- Order by code, year
ORDER BY code, year;
1
2
3
4
5
6
7
8
9
10
11
sql_stmt = "\
SELECT code, year \
FROM countries.economies \
UNION ALL \
SELECT country_code, year \
FROM countries.populations \
ORDER BY code, year; \
"
country_codes_year = pd.read_sql(sql_stmt, conn)
country_codes_year.head()
code | year | |
---|---|---|
0 | ABW | 2010 |
1 | ABW | 2015 |
2 | AFG | 2010 |
3 | AFG | 2010 |
4 | AFG | 2015 |
1
country_codes_year.tail()
code | year | |
---|---|---|
809 | ZMB | 2015 |
810 | ZWE | 2010 |
811 | ZWE | 2010 |
812 | ZWE | 2015 |
813 | ZWE | 2015 |
INTERSECTional data science
- The set theory clause
INTERSECT
works in a similar fashion toUNION
andUNION ALL
, but remember from the Venn diagram,INTERSECT
only includes those records in common to both tables and fields selected.
1
2
3
4
5
SELECT id
FROM left_one
INTERSECT
SELECT id
FROM right_one;
- The result only includes records common to the tables selected
- Determine countries with both a prime minister and president
- The code for each of these set operations has a similar layout.
- First select which fields to include from the first table, and then specify the name of the first table.
- Specify the set operation to perform
- Lastly, denote which fields to include from the second table, and then the name of the second table.
1
2
3
4
5
SELECT country
FROM leaders.prime_ministers
INTERSECT
SELECT country
FROM leaders.presidents
- What happens if two columns are selected, instead of one?
1
2
3
4
5
SELECT country, prime_minister as leader
FROM leaders.prime_ministers
INTERSECT
SELECT country, president
FROM leaders.presidents
- Will this also give you the names of the countries with both type of leaders?
- This results in an empty table.
- When
INTERSECT
looks at two columns, it includes both columns in the search. - It didn’t find any countries with prime ministers AND presidents having the same name.
INTERSECT
looks for records in common, not individual key fields like what a join does to match.
1
2
3
4
5
6
7
8
9
sql_stmt = "\
SELECT country \
FROM leaders.prime_ministers \
INTERSECT \
SELECT country \
FROM leaders.presidents \
"
pd.read_sql(sql_stmt, conn)
country | |
---|---|
0 | Portugal |
1 | Egypt |
2 | Haiti |
3 | Vietnam |
Exercises
INTERSECT
Repeat the previous UNION ALL
exercise, this time looking at the records in common for country code and year for the economies
and populations
tables.
Instructions
- Again, order by
code
and then byyear
, both in ascending order. - Note the number of records here (given at the bottom of query result) compared to the similar
UNION ALL
query result (814 records).
1
2
3
4
5
6
7
8
9
10
11
12
-- Select fields
SELECT code, year
-- From economies
FROM countries.economies
-- Set theory clause
INTERSECT
-- Select fields
SELECT country_code, year
-- From populations
FROM countries.populations
-- Order by code and year
ORDER BY code, year;
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT code, year \
FROM countries.economies \
INTERSECT \
SELECT country_code, year \
FROM countries.populations \
ORDER BY code, year; \
"
pd.read_sql(sql_stmt, conn)
code | year | |
---|---|---|
0 | AFG | 2010 |
1 | AFG | 2015 |
2 | AGO | 2010 |
3 | AGO | 2015 |
4 | ALB | 2010 |
... | ... | ... |
375 | ZAF | 2015 |
376 | ZMB | 2010 |
377 | ZMB | 2015 |
378 | ZWE | 2010 |
379 | ZWE | 2015 |
380 rows × 2 columns
INTERSECT (2)
As you think about major world cities and their corresponding country, you may ask which countries also have a city with the same name as their country name?
Instructions
- Use
INTERSECT
to answer this question withcountries
andcities
!
1
2
3
4
5
6
7
8
9
10
-- Select fields
SELECT name
-- From countries
FROM countries.countries
-- Set theory clause
INTERSECT
-- Select fields
SELECT name
-- From cities
FROM countries.cities;
1
2
3
4
5
6
7
8
9
sql_stmt = "\
SELECT name \
FROM countries.countries \
INTERSECT \
SELECT name \
FROM countries.cities; \
"
pd.read_sql(sql_stmt, conn)
name | |
---|---|
0 | Singapore |
1 | Hong Kong |
Hong Kong is part of China, but it appears separately here because it has its own ISO country code. Depending upon your analysis, treating Hong Kong separately could be useful or a mistake. Always check your dataset closely before you perform an analysis!
Review UNION and INTERSECT
Which of the following combinations of terms and definitions is correct?
Answer the question
UNION: returns all records (potentially duplicates) in both tablesUNION ALL: returns only unique records- **INTERSECT: returns only records appearing in both tables**
None of the above are matched correctly
EXCEPTional
EXCEPT
includes only the records in one table, but not in the other.- There are some monarchs that also act as the prime minister. One way to determine those monarchs in the monarchs table that do not also hold the title prime minister, is to use the
EXCEPT
clause.
1
2
3
4
5
SELECT monarch, country
FROM leaders.monarchs
EXCEPT
SELECT prime_minister, country
FROM leaders.prime_ministers;
- This SQL query selects the monarch field from monarchs, then looks for common entries with the prime_ministers field, while also keeping track of the country for each leader.
- Only the two European monarchs are not also prime ministers in the leaders database.
- Only the records that appear in the left table, BUT DO NOT appear in the right table are included.
1
2
3
4
5
6
7
8
9
sql_stmt = "\
SELECT monarch, country \
FROM leaders.monarchs \
EXCEPT \
SELECT prime_minister, country \
FROM leaders.prime_ministers; \
"
pd.read_sql(sql_stmt, conn)
monarch | country | |
---|---|---|
0 | Harald V | Norway |
1 | Felipe VI | Spain |
Exercises
EXCEPT
Get the names of cities in cities
which are not noted as capital cities in countries
as a single field result.
Note that there are some countries in the world that are not included in the countries
table, which will result in some cities not being labeled as capital cities when in fact they are.
Instructions
- Order the resulting field in ascending order.
- Can you spot the city/cities that are actually capital cities which this query misses?
1
2
3
4
5
6
7
8
9
10
11
12
-- Select field
SELECT name
-- From cities
FROM countries.cities
-- Set theory clause
EXCEPT
-- Select field
SELECT capital
-- From countries
FROM countries.countries
-- Order by result
ORDER BY name;
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT name \
FROM countries.cities \
EXCEPT \
SELECT capital \
FROM countries.countries \
ORDER BY name; \
"
pd.read_sql(sql_stmt, conn).head()
name | |
---|---|
0 | Abidjan |
1 | Ahmedabad |
2 | Alexandria |
3 | Almaty |
4 | Auckland |
EXCEPT (2)
Now you will complete the previous query in reverse!
Determine the names of capital cities that are not listed in the cities
table.
Instructions
- Order by
capital
in ascending order. - The
cities
table contains information about 236 of the world’s most populous cities. The result of your query may surprise you in terms of the number of capital cities that DO NOT appear in this list!
1
2
3
4
5
6
7
8
9
10
11
12
-- Select field
SELECT capital
-- From countries
FROM countries.countries
-- Set theory clause
EXCEPT
-- Select field
SELECT name
-- From cities
FROM countries.cities
-- Order by ascending capital
ORDER BY capital;
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT capital \
FROM countries.countries \
EXCEPT \
SELECT name \
FROM countries.cities \
ORDER BY capital; \
"
pd.read_sql(sql_stmt, conn).head()
capital | |
---|---|
0 | Agana |
1 | Amman |
2 | Amsterdam |
3 | Andorra la Vella |
4 | Antananarivo |
Semi-JOINs and Anti-JOINS
- The previous six joins are all additive joins, in that they add columns to the original left table.
- INNER JOIN
- SELF JOIN
- LEFT JOIN
- RIGHT JOIN
- FULL JOIN
- CROSS JOIN
- The last two joins use a right table to determine which records to keep in the left table.
- Use these last to joins in a way similar to a WHERE clause dependent on the values of a second table.
semi-joins
andanti-joins
don’t have the same built-in SQL syntax that INNER JOIN and LEFT JOIN have.semi-joins
andanti-joins
are useful tools in filtering table records on the records of another table.- The challenge will be to combine set theory clauses with semi-joins.
- Determine the countries that gained independence before 1800.
1
2
3
SELECT name
FROM leaders.states
WHERE indep_year < 1800;
1
2
3
country
0 Portugal
1 Spain
- Determine president, country and continent
1
2
SELECT president, country, continent
FROM leaders.presidents
1
2
3
4
5
6
7
8
president country continent
0 Abdel Fattah el-Sisi Egypt Africa
1 Marcelo Rebelo de Sousa Portugal Europe
2 Jovenel Moise Haiti North America
3 Jose Mujica Uruguay South America
4 Ellen Johnson Sirleaf Liberia Africa
5 Michelle Bachelet Chile South America
6 Tran Dai Quang Vietnam Asia
SEMI JOIN
- Determine the presidents of countries that gained independence before 1800.
1
2
3
4
5
6
SELECT president, country, continent
FROM leaders.presidents
WHERE country IN
(SELECT name
FROM leaders.states
WHERE indep_year < 1800);
- This is an example of a subquery, which is a query that sits inside another query.
- Does this include the presidents of Spain and Portugal?
- Since Spain does not have a president, it’s not included here and only the Portuguese president is listed.
- The
semi-join
chooses records in the first table where a condition IS met in the second table. - The
semi-join
matches records by key field in the right table with those in the left. - It then picks out only the rows in the left table that match the condition.
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT president, country, continent \
FROM leaders.presidents \
WHERE country IN \
(SELECT name \
FROM leaders.states \
WHERE indep_year < 1800); \
"
pd.read_sql(sql_stmt, conn)
president | country | continent | |
---|---|---|---|
0 | Marcelo Rebelo de Sousa | Portugal | Europe |
ANTI JOIN
- An
anti-join
chooses records in the first table where a condition IS NOT met in the second table. - Determine countries in the Americas founded after 1800.
- Use
NOT
to exclude those countries in the subquery.
1
2
3
4
5
6
7
SELECT president, country, continent
FROM leaders.presidents
WHERE continent LIKE '%America'
AND country NOT IN
(SELECT name
FROM leaders.states
WHERE indep_year < 1800);
- The
anti-join
picks out those columns in the left table that do not match the condition on the right table.
1
2
3
4
5
6
7
8
9
10
11
sql_stmt = "\
SELECT president, country, continent \
FROM leaders.presidents \
WHERE continent LIKE '%%America' \
AND country NOT IN \
(SELECT name \
FROM leaders.states \
WHERE indep_year < 1800); \
"
pd.read_sql(sql_stmt, conn)
president | country | continent | |
---|---|---|---|
0 | Jovenel Moise | Haiti | North America |
1 | Jose Mujica | Uruguay | South America |
2 | Michelle Bachelet | Chile | South America |
Exercises
Semi-JOIN
You are now going to use the concept of a semi-join to identify languages spoken in the Middle East.
Instructions 1/3
- Flash back to our Intro to SQL for Data Science course and begin by selecting all country codes in the Middle East as a single field result using
SELECT
,FROM
, andWHERE
.
1
2
3
4
5
6
-- Select code
SELECT code
-- From countries
FROM countries
-- Where region is Middle East
WHERE region = 'Middle East';
1
2
3
4
5
6
7
sql_stmt = "\
SELECT code \
FROM countries.countries \
WHERE region = 'Middle East'; \
"
pd.read_sql(sql_stmt, conn)
code | |
---|---|
0 | ARE |
1 | ARM |
2 | AZE |
3 | BHR |
4 | GEO |
5 | IRQ |
6 | ISR |
7 | YEM |
8 | JOR |
9 | KWT |
10 | CYP |
11 | LBN |
12 | OMN |
13 | QAT |
14 | SAU |
15 | SYR |
16 | TUR |
17 | PSE |
You are now going to use the concept of a semi-join to identify languages spoken in the Middle East.
Instructions 2/3
- Comment out the answer to the previous tab by surrounding it in
/*
and*/
. You’ll come back to it! - Below the commented code, select only unique languages by name appearing in the
languages
table. - Order the resulting single field table by
name
in ascending order.
1
2
3
4
5
6
-- Select field
SELECT DISTINCT name
-- From languages
FROM languages
-- Order by name
ORDER BY name;
1
2
3
4
5
6
7
sql_stmt = "\
SELECT DISTINCT name \
FROM countries.languages \
ORDER BY name; \
"
pd.read_sql(sql_stmt, conn)
name | |
---|---|
0 | Afar |
1 | Afrikaans |
2 | Akyem |
3 | Albanian |
4 | Alsatian |
... | ... |
391 | Yapese |
392 | Yoruba |
393 | Yue |
394 | Zezuru |
395 | Zulu |
396 rows × 1 columns
You are now going to use the concept of a semi-join to identify languages spoken in the Middle East.
Instructions 3/3
Now combine the previous two queries into one query:
- Add a
WHERE IN
statement to theELECT DISTINCT
query, and use the commented out query from the first instruction in there. That way, you can determine the unique languages spoken in the Middle East.
Carefully review this result and its code after completing it. It serves as a great example of subqueries, which are the focus of Chapter 4.
1
2
3
4
5
6
7
8
9
10
11
12
-- Select distinct fields
SELECT DISTINCT name
-- From languages
FROM languages
-- Where in statement
WHERE code IN
-- Subquery
(SELECT code
FROM countries
WHERE region = 'Middle East')
-- Order by name
ORDER BY name;
1
2
3
4
5
6
7
8
9
10
11
sql_stmt = "\
SELECT DISTINCT name \
FROM countries.languages \
WHERE code IN \
(SELECT code \
FROM countries.countries \
WHERE region = 'Middle East') \
ORDER BY name; \
"
pd.read_sql(sql_stmt, conn)
name | |
---|---|
0 | Arabic |
1 | Aramaic |
2 | Armenian |
3 | Azerbaijani |
4 | Azeri |
5 | Baluchi |
6 | Bulgarian |
7 | Circassian |
8 | English |
9 | Farsi |
10 | Filipino |
11 | French |
12 | Georgian |
13 | Greek |
14 | Hebrew |
15 | Hindi |
16 | Indian |
17 | Kurdish |
18 | Other |
19 | Persian |
20 | Romanian |
21 | Russian |
22 | Syriac |
23 | Turkish |
24 | Turkmen |
25 | unspecified |
26 | Urdu |
Relating semi-JOIN to a tweaked INNER JOIN
Let’s revisit the code from the previous exercise, which retrieves languages spoken in the Middle East.
1
2
3
4
5
6
7
SELECT DISTINCT name
FROM languages
WHERE code IN
(SELECT code
FROM countries
WHERE region = 'Middle East')
ORDER BY name;
Sometimes problems solved with semi-joins can also be solved using an inner join.
1
2
3
4
5
6
SELECT languages.name AS language
FROM languages
INNER JOIN countries
ON languages.code = countries.code
WHERE region = 'Middle East'
ORDER BY language;
This inner join isn’t quite right. What is missing from this second code block to get it to match with the correct answer produced by the first block?
Possible Answers
HAVING
instead ofWHERE
- **
DISTINCT
** UNIQUE
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT DISTINCT languages.name AS language \
FROM countries.languages \
INNER JOIN countries.countries \
ON languages.code = countries.code \
WHERE region = 'Middle East' \
ORDER BY language; \
"
pd.read_sql(sql_stmt, conn)
language | |
---|---|
0 | Arabic |
1 | Aramaic |
2 | Armenian |
3 | Azerbaijani |
4 | Azeri |
5 | Baluchi |
6 | Bulgarian |
7 | Circassian |
8 | English |
9 | Farsi |
10 | Filipino |
11 | French |
12 | Georgian |
13 | Greek |
14 | Hebrew |
15 | Hindi |
16 | Indian |
17 | Kurdish |
18 | Other |
19 | Persian |
20 | Romanian |
21 | Russian |
22 | Syriac |
23 | Turkish |
24 | Turkmen |
25 | unspecified |
26 | Urdu |
Diagnosing problems using anti-JOIN
Another powerful join in SQL is the anti-join. It is particularly useful in identifying which records are causing an incorrect number of records to appear in join queries.
You will also see another example of a subquery here, as you saw in the first exercise on semi-joins. Your goal is to identify the currencies used in Oceanian countries!
Instructions 1/3
- Begin by determining the number of countries in
countries
that are listed in Oceania usingSELECT
,FROM
, andWHERE
.
1
2
3
4
5
6
-- Select statement
SELECT count(name)
-- From countries
FROM countries
-- Where continent is Oceania
WHERE continent = 'Oceania';
1
2
3
4
5
6
7
sql_stmt = "\
SELECT count(name) \
FROM countries.countries \
WHERE continent = 'Oceania'; \
"
pd.read_sql(sql_stmt, conn)
count | |
---|---|
0 | 19 |
Instructions 2/3
- Complete an inner join with
countries AS c1
on the left andcurrencies AS c2
on the right to get the different currencies used in the countries of Oceania. - Match
ON
thecode
field in the two tables. - Include the country
code
, countryname
, andbasic_unit AS currency
.
Observe query result and make note of how many different countries are listed here.
1
2
3
4
5
6
7
8
9
10
-- 5. Select fields (with aliases)
SELECT c1.code, c1.name, c2.basic_unit as currency
-- 1. From countries (alias as c1)
FROM countries as c1
-- 2. Join with currencies (alias as c2)
INNER JOIN currencies as c2
-- 3. Match on code
ON c1.code = c2.code
-- 4. Where continent is Oceania
WHERE continent = 'Oceania';
1
2
3
4
5
6
7
8
9
sql_stmt = "\
SELECT c1.code, c1.name, c2.basic_unit as currency \
FROM countries.countries as c1 \
INNER JOIN countries.currencies as c2 \
ON c1.code = c2.code \
WHERE continent = 'Oceania'; \
"
pd.read_sql(sql_stmt, conn)
code | name | currency | |
---|---|---|---|
0 | AUS | Australia | Australian dollar |
1 | PYF | French Polynesia | CFP franc |
2 | KIR | Kiribati | Australian dollar |
3 | MHL | Marshall Islands | United States dollar |
4 | NRU | Nauru | Australian dollar |
5 | NCL | New Caledonia | CFP franc |
6 | NZL | New Zealand | New Zealand dollar |
7 | PLW | Palau | United States dollar |
8 | PNG | Papua New Guinea | Papua New Guinean kina |
9 | WSM | Samoa | Samoan tala |
10 | SLB | Solomon Islands | Solomon Islands dollar |
11 | TON | Tonga | Tongan paʻanga |
12 | TUV | Tuvalu | Australian dollar |
13 | TUV | Tuvalu | Tuvaluan dollar |
14 | VUT | Vanuatu | Vanuatu vatu |
Instructions 3/3
Note that not all countries in Oceania were listed in the resulting inner join with currencies. Use an anti-join to determine which countries were not included!
- Use
NOT IN
and(SELECT code FROM currencies)
as a subquery to get the country code and country name for the Oceanian countries that are not included in thecurrencies
table.
1
2
3
4
5
6
7
8
9
10
11
-- 3. Select fields
SELECT code, name
-- 4. From Countries
FROM countries
-- 5. Where continent is Oceania
WHERE continent = 'Oceania'
-- 1. And code not in
AND code NOT IN
-- 2. Subquery
(SELECT code
FROM currencies);
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT code, name \
FROM countries.countries \
WHERE continent = 'Oceania' \
AND code NOT IN \
(SELECT code \
FROM countries.currencies); \
"
pd.read_sql(sql_stmt, conn)
code | name | |
---|---|---|
0 | ASM | American Samoa |
1 | FJI | Fiji Islands |
2 | GUM | Guam |
3 | FSM | Micronesia, Federated States of |
4 | MNP | Northern Mariana Islands |
Set theory challenge
Congratulations! You’ve now made your way to the challenge problem for this third chapter. Your task here will be to incorporate two of UNION
/UNION ALL
/INTERSECT
/EXCEPT
to solve a challenge involving three tables.
In addition, you will use a subquery as you have in the last two exercises! This will be great practice as you hop into subqueries more in Chapter 4!
Instructions
- Identify the country codes that are included in either
economies
orcurrencies
but not inpopulations
. - Use that result to determine the names of cities in the countries that match the specification in the previous instruction.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Select the city name
SELECT country_code, name
-- Alias the table where city name resides
FROM cities AS c1
-- Choose only records matching the result of multiple set theory clauses
WHERE country_code IN
(
-- Select appropriate field from economies AS e
SELECT e.code
FROM economies AS e
-- Get all additional (unique) values of the field from currencies AS c2
UNION ALL
SELECT c2.code
FROM currencies AS c2
-- Exclude those appearing in populations AS p
EXCEPT
SELECT p.country_code
FROM populations AS p
);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
sql_stmt = "\
SELECT country_code, name \
FROM countries.cities AS c1 \
WHERE country_code IN \
( SELECT e.code \
FROM countries.economies AS e \
UNION ALL \
SELECT c2.code \
FROM countries.currencies AS c2 \
EXCEPT \
SELECT p.country_code \
FROM countries.populations AS p); \
"
pd.read_sql(sql_stmt, conn)
country_code | name | |
---|---|---|
0 | ROM | Bucharest |
1 | TWN | Kaohsiung |
2 | TWN | New Taipei City |
3 | TWN | Taichung |
4 | TWN | Tainan |
5 | TWN | Taipei |
Subqueries
In this closing chapter, you’ll learn how to use nested queries to add some finesse to your data insights. You’ll also wrap all of the content covered throughout this course into solving three challenge problems.
Subqueries inside WHERE and SELECT clauses
- The most common type of subquery is one inside of a
WHERE
statement. - Examples include semi-join and anti-join
leaders.states
1
2
SELECT name, indep_year, fert_rate, women_parli_perc
FROM leaders.states;
Average fert_rate
1
2
SELECT AVG(fert_rate)
FROM leaders.states;
Asian countries below average fert_rate
1
2
3
4
5
6
SELECT name, fert_rate
FROM leaders.states
WHERE continent = 'Asia'
AND fert_rate <
(SELECT AVG(fert_rate)
FROM leaders.states);
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT name, fert_rate \
FROM leaders.states \
WHERE continent = 'Asia' \
AND fert_rate < \
(SELECT AVG(fert_rate) \
FROM leaders.states); \
"
pd.read_sql(sql_stmt, conn)
name | fert_rate | |
---|---|---|
0 | Brunei | 1.96 |
1 | Vietnam | 1.70 |
- The second most common type of subquery is inside of a
SELECT
clause.
Count the number of countries listed in states table for each continent in the prime_ministers
table.
Continents in the prime_ministers
table
1
2
SELECT DISTINCT continent
FROM prime_ministers;
Determine the counts of the number of countries in states for each of the continents in the last slide
1
2
3
4
5
6
SELECT count(name)
FROM leaders.states
WHERE continent IN
(SELECT DISTINCT continent
FROM leaders.prime_ministers);
--Returns total count of countries
1
2
3
4
5
SELECT DISTINCT continent,
(SELECT COUNT(*)
FROM leaders.states
WHERE prime_ministers.continent = states.continent) as countries_num
FROM leaders.prime_ministers
- The subquery involving states, can also reference the
prime_ministers
table in the main query. - Anytime you do a subquery inside a
SELECT
statement, you need to give the subquery an alias (e.g.countries_num
in the example) - There are numerous ways to solve problems with SQL queries.
- A carefully constructed JOIN could achieve this same result.
1
2
3
4
5
6
7
8
9
sql_stmt = "\
SELECT DISTINCT continent, \
(SELECT COUNT(*) \
FROM leaders.states \
WHERE prime_ministers.continent = states.continent) as countries_num \
FROM leaders.prime_ministers \
"
pd.read_sql(sql_stmt, conn)
continent | countries_num | |
---|---|---|
0 | Africa | 2 |
1 | Asia | 4 |
2 | Europe | 3 |
3 | North America | 1 |
4 | Oceania | 1 |
Exercises
Subquery inside WHERE
You’ll now try to figure out which countries had high average life expectancies (at the country level) in 2015.
Instructions 1/2
- Begin by calculating the average life expectancy across all countries for 2015.
1
2
3
4
5
6
-- Select average life_expectancy
SELECT avg(life_expectancy)
-- From populations
FROM populations
-- Where year is 2015
WHERE year = 2015;
1
2
3
4
5
6
7
sql_stmt = "\
SELECT avg(life_expectancy) \
FROM countries.populations \
WHERE year = 2015; \
"
pd.read_sql(sql_stmt, conn)
avg | |
---|---|
0 | 71.676342 |
Instructions 2/2
- Recall that you can use SQL to do calculations for you. Suppose we wanted only records that were above
1.15 * 100
in terms of life expectancy for 2015:
1
2
3
4
SELECT *
FROM populations
WHERE life_expectancy > 1.15 * 100
AND year = 2015;
- Select all fields from
populations
with records corresponding to larger than 1.15 times the average you calculated in the first task for 2015. In other words, change the100
in the example above with a subquery.
1
2
3
4
5
6
7
8
9
10
11
12
-- Select fields
SELECT *
-- From populations
FROM populations
-- Where life_expectancy is greater than
WHERE life_expectancy >
-- 1.15 * subquery
1.15 *
(SELECT avg(life_expectancy)
FROM populations
WHERE year = 2015)
AND year = 2015;
1
2
3
4
5
6
7
8
9
10
11
12
sql_stmt = "\
SELECT * \
FROM countries.populations \
WHERE life_expectancy > \
1.15 * \
(SELECT avg(life_expectancy) \
FROM countries.populations \
WHERE year = 2015) \
AND year = 2015; \
"
pd.read_sql(sql_stmt, conn)
pop_id | country_code | year | fertility_rate | life_expectancy | size | |
---|---|---|---|---|---|---|
0 | 21 | AUS | 2015 | 1.833 | 82.451220 | 23789752.0 |
1 | 376 | CHE | 2015 | 1.540 | 83.197560 | 8281430.0 |
2 | 356 | ESP | 2015 | 1.320 | 83.380486 | 46443992.0 |
3 | 134 | FRA | 2015 | 2.010 | 82.670730 | 66538392.0 |
4 | 170 | HKG | 2015 | 1.195 | 84.278046 | 7305700.0 |
5 | 174 | ISL | 2015 | 1.930 | 82.860980 | 330815.0 |
6 | 190 | ITA | 2015 | 1.370 | 83.490240 | 60730584.0 |
7 | 194 | JPN | 2015 | 1.460 | 83.843660 | 126958470.0 |
8 | 340 | SGP | 2015 | 1.240 | 82.595120 | 5535002.0 |
9 | 374 | SWE | 2015 | 1.880 | 82.551216 | 9799186.0 |
Subquery inside WHERE (2)
Use your knowledge of subqueries in WHERE
to get the urban area population for only capital cities.
Instructions
- Make use of the
capital
field in thecountries
table in your subquery. - Select the city name, country code, and urban area population fields.
1
2
3
4
5
6
7
8
9
10
-- 2. Select fields
SELECT name, country_code, urbanarea_pop
-- 3. From cities
FROM cities
-- 4. Where city name in the field of capital cities
WHERE name IN
-- 1. Subquery
(SELECT capital
FROM countries)
ORDER BY urbanarea_pop DESC;
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT name, country_code, urbanarea_pop \
FROM countries.cities \
WHERE name IN \
(SELECT capital \
FROM countries.countries) \
ORDER BY urbanarea_pop DESC; \
"
pd.read_sql(sql_stmt, conn)
name | country_code | urbanarea_pop | |
---|---|---|---|
0 | Beijing | CHN | 21516000.0 |
1 | Dhaka | BGD | 14543124.0 |
2 | Tokyo | JPN | 13513734.0 |
3 | Moscow | RUS | 12197596.0 |
4 | Cairo | EGY | 10230350.0 |
5 | Kinshasa | COD | 10130000.0 |
6 | Jakarta | IDN | 10075310.0 |
7 | Seoul | KOR | 9995784.0 |
8 | Mexico City | MEX | 8974724.0 |
9 | Lima | PER | 8852000.0 |
10 | London | GBR | 8673713.0 |
11 | Bangkok | THA | 8280925.0 |
12 | Tehran | IRN | 8154051.0 |
13 | Bogota | COL | 7878783.0 |
14 | Baghdad | IRQ | 7180889.0 |
15 | Hanoi | VNM | 6844100.0 |
16 | Santiago | CHL | 5743719.0 |
17 | Riyadh | SAU | 5676621.0 |
18 | Singapore | SGP | 5535000.0 |
19 | Ankara | TUR | 5271000.0 |
20 | Khartoum | SDN | 3639598.0 |
21 | Berlin | DEU | 3517424.0 |
22 | Algiers | DZA | 3415811.0 |
23 | Kabul | AFG | 3414100.0 |
24 | Pyongyang | PRK | 3255388.0 |
25 | Madrid | ESP | 3207247.0 |
26 | Baku | AZE | 3202300.0 |
27 | Nairobi | KEN | 3138369.0 |
28 | Addis Ababa | ETH | 3103673.0 |
29 | Buenos Aires | ARG | 3054300.0 |
30 | Kiev | UKR | 2908703.0 |
31 | Rome | ITA | 2877215.0 |
32 | Luanda | AGO | 2825311.0 |
33 | Quito | ECU | 2671191.0 |
34 | Managua | NIC | 2560789.0 |
35 | Brasilia | BRA | 2556149.0 |
36 | Yaounde | CMR | 2440462.0 |
37 | Tashkent | UZB | 2309600.0 |
38 | Phnom Penh | KHM | 2234566.0 |
39 | Paris | FRA | 2229621.0 |
40 | Ouagadougou | BFA | 2200000.0 |
41 | Guatemala City | GTM | 2110100.0 |
42 | Havana | CUB | 2106146.0 |
43 | Accra | GHA | 2070463.0 |
44 | Minsk | BLR | 1959781.0 |
45 | Caracas | VEN | 1943901.0 |
46 | Sana'a | YEM | 1937451.0 |
47 | Islamabad | PAK | 1900000.0 |
48 | Vienna | AUT | 1863881.0 |
49 | Brazzaville | COG | 1827000.0 |
50 | Manila | PHL | 1780148.0 |
51 | Kuala Lumpur | MYS | 1768000.0 |
52 | Maputo | MOZ | 1766184.0 |
53 | Budapest | HUN | 1759407.0 |
54 | Warsaw | POL | 1753977.0 |
55 | Lusaka | ZMB | 1742979.0 |
56 | Harare | ZWE | 1606000.0 |
57 | Kampala | UGA | 1507080.0 |
58 | Prague | CZE | 1324000.0 |
59 | Montevideo | URY | 1305082.0 |
60 | Abuja | NGA | 1235880.0 |
61 | Dakar | SEN | 1146053.0 |
62 | Abu Dhabi | ARE | 1145000.0 |
63 | Tripoli | LBY | 1126000.0 |
64 | Yerevan | ARM | 1060138.0 |
65 | Tunis | TUN | 1056247.0 |
Subquery inside SELECT
In this exercise, you’ll see how some queries can be written using either a join or a subquery.
You have seen previously how to use GROUP BY
with aggregate functions and an inner join to get summarized information from multiple tables.
The code given in query.sql selects the top nine countries in terms of number of cities appearing in the cities
table. Recall that this corresponds to the most populous cities in the world. Your task will be to convert the commented out code to get the same result as the code shown.
Instructions 1/2
- Just Submit Answer here!
1
2
3
4
5
6
7
SELECT countries.name AS country, COUNT(*) AS cities_num
FROM cities
INNER JOIN countries
ON countries.code = cities.country_code
GROUP BY country
ORDER BY cities_num DESC, country
LIMIT 9;
1
2
3
4
5
6
7
8
9
10
11
sql_stmt = "\
SELECT countries.name AS country, COUNT(*) AS cities_num \
FROM countries.cities \
INNER JOIN countries.countries \
ON countries.code = cities.country_code \
GROUP BY country \
ORDER BY cities_num DESC, country \
LIMIT 9; \
"
pd.read_sql(sql_stmt, conn)
country | cities_num | |
---|---|---|
0 | China | 36 |
1 | India | 18 |
2 | Japan | 11 |
3 | Brazil | 10 |
4 | Pakistan | 9 |
5 | United States | 9 |
6 | Indonesia | 7 |
7 | Russian Federation | 7 |
8 | South Korea | 7 |
Instructions 2/2
- Remove the comments around the second query and comment out the first query instead.
- Convert the
GROUP BY
code to use a subquery inside ofSELECT
, i.e. fill in the blanks to get a result that matches the one given using theGROUP BY
code in the first query. - Again, sort the result by
cities_num
descending and then bycountry
ascending.
1
2
3
4
5
6
7
SELECT countries.name AS country,
(SELECT count(*)
FROM cities
WHERE countries.code = cities.country_code) AS cities_num
FROM countries
ORDER BY cities_num DESC, country
LIMIT 9;
1
2
3
4
5
6
7
8
9
10
11
sql_stmt = "\
SELECT countries.name AS country, \
(SELECT count(*) \
FROM countries.cities \
WHERE countries.code = cities.country_code) AS cities_num \
FROM countries.countries \
ORDER BY cities_num DESC, country \
LIMIT 9; \
"
pd.read_sql(sql_stmt, conn)
country | cities_num | |
---|---|---|
0 | China | 36 |
1 | India | 18 |
2 | Japan | 11 |
3 | Brazil | 10 |
4 | Pakistan | 9 |
5 | United States | 9 |
6 | Indonesia | 7 |
7 | Russian Federation | 7 |
8 | South Korea | 7 |
Subquery inside FROM clause
- The last basic type of subquery exists inside of a
FROM
clause. Determine the maximum percentage of women in parliament for each continent listing inleaders.states
1
2
3
4
SELECT continent, MAX(women_parli_perc) AS max_perc
FROM states
GROUP BY continent
ORDER BY continent;
- This query will only work if
continent
is included as one of th fields in theSELECT
clause, since we are grouping based on that field.
1
2
3
4
5
6
7
8
sql_stmt = "\
SELECT continent, MAX(women_parli_perc) AS max_perc \
FROM leaders.states \
GROUP BY continent \
ORDER BY continent; \
"
pd.read_sql(sql_stmt, conn)
continent | max_perc | |
---|---|---|
0 | Africa | 14.90 |
1 | Asia | 24.00 |
2 | Europe | 39.60 |
3 | North America | 2.74 |
4 | Oceania | 32.74 |
5 | South America | 22.31 |
Focusing on records in monarchs
- Multiple tables can be included in the
FROM
clause, by adding a comma between them
1
2
3
4
SELECT monarchs.continent
FROM monarchs, states
WHERE monarchs.continent = states.continent
ORDER BY continent;
- Produces part of the answer; how should duplicates be removed?
1
2
3
4
5
6
7
8
sql_stmt = "\
SELECT monarchs.continent \
FROM leaders.monarchs, leaders.states \
WHERE monarchs.continent = states.continent \
ORDER BY continent; \
"
pd.read_sql(sql_stmt, conn)
continent | |
---|---|
0 | Asia |
1 | Asia |
2 | Asia |
3 | Asia |
4 | Asia |
5 | Asia |
6 | Asia |
7 | Asia |
8 | Europe |
9 | Europe |
10 | Europe |
11 | Europe |
12 | Europe |
13 | Europe |
Finishing the subquery
- To get Asia and Europe to appear only once, use
DISTINCT
in theSELECT
statement.
1
2
3
4
5
6
7
SELECT DISTINCT monarchs.continent, subquery.max_perc
FROM monarchs,
(SELECT continent, MAX(women_parli_perc) AS max_perc
FROM states
GROUP BY continent) AS subquery
WHERE monarchs.continent = subquery.continent
ORDER BY continent;
- How is the
max_perc
column included with continent? - Instead of including states in the
FROM
clause, include the subquery instead and alias it with a name likesubquery
. - This is how to include a subquery as a temporary table in the
FROM
clause.
1
2
3
4
5
6
7
8
9
10
11
sql_stmt = "\
SELECT DISTINCT monarchs.continent, subquery.max_perc \
FROM leaders.monarchs, \
(SELECT continent, MAX(women_parli_perc) AS max_perc \
FROM leaders.states \
GROUP BY continent) AS subquery \
WHERE monarchs.continent = subquery.continent \
ORDER BY continent; \
"
pd.read_sql(sql_stmt, conn)
continent | max_perc | |
---|---|---|
0 | Asia | 24.0 |
1 | Europe | 39.6 |
Exercises
Subquery inside FROM
The last type of subquery you will work with is one inside of FROM
.
You will use this to determine the number of languages spoken for each country, identified by the country’s local name! (Note this may be different than the name
field and is stored in the local_name
field.)
Instructions 1/2
Begin by determining for each country code how many languages are listed in the languages
table using SELECT
, FROM
, and GROUP BY
. Alias the aggregated field as lang_num
.
1
2
3
4
5
6
-- Select fields (with aliases)
SELECT code, count(name) AS lang_num
-- From languages
FROM languages
-- Group by code
GROUP BY code;
1
2
3
4
5
6
7
8
9
10
sql_stmt = "\
SELECT code, count(name) AS lang_num \
FROM countries.languages \
GROUP BY code \
ORDER BY lang_num DESC; \
"
lang_count = pd.read_sql(sql_stmt, conn)
print(lang_count.head())
print(lang_count.tail())
1
2
3
4
5
6
7
8
9
10
11
12
code lang_num
0 ZMB 19
1 ZWE 16
2 ETH 16
3 IND 14
4 NPL 14
code lang_num
207 COL 1
208 AIA 1
209 DOM 1
210 SAU 1
211 PRK 1
Instructions 2/2
- Include the previous query (aliased as
subquery
) as a subquery in theFROM
clause of a new query. - Select the local name of the country from
countries
. - Also, select
lang_num
fromsubquery
. - Make sure to use
WHERE
appropriately to matchcode
incountries
and insubquery
. - Sort by
lang_num
in descending order.
1
2
3
4
5
6
7
8
9
10
11
12
-- Select fields
SELECT local_name, subquery.lang_num
-- From countries
FROM countries,
-- Subquery (alias as subquery)
(SELECT code, count(name) as lang_num
FROM languages
GROUP BY code) AS subquery
-- Where codes match
WHERE countries.code = subquery.code
-- Order by descending number of languages
ORDER BY lang_num DESC;
1
2
3
4
5
6
7
8
9
10
11
12
13
sql_stmt = "\
SELECT local_name, subquery.lang_num \
FROM countries.countries, \
(SELECT code, count(name) as lang_num \
FROM countries.languages \
GROUP BY code) AS subquery \
WHERE countries.code = subquery.code \
ORDER BY lang_num DESC; \
"
lang_count = pd.read_sql(sql_stmt, conn)
print(lang_count.head())
print(lang_count.tail())
1
2
3
4
5
6
7
8
9
10
11
12
local_name lang_num
0 Zambia 19
1 YeItyop´iya 16
2 Zimbabwe 16
3 Bharat/India 14
4 Nepal 14
local_name lang_num
194 Republica Dominicana 1
195 The Turks and Caicos Islands 1
196 Nederland 1
197 United Kingdom 1
198 Brasil 1
Advanced subquery
You can also nest multiple subqueries to answer even more specific questions.
In this exercise, for each of the six continents listed in 2015, you’ll identify which country had the maximum inflation rate (and how high it was) using multiple subqueries. The table result of your query in Task 3 should look something like the following, where anything between <
>
will be filled in with appropriate values:
1
2
3
4
5
6
7
8
9
10
+------------+---------------+-------------------+
| name | continent | inflation_rate |
|------------+---------------+-------------------|
| <country1> | North America | <max_inflation1> |
| <country2> | Africa | <max_inflation2> |
| <country3> | Oceania | <max_inflation3> |
| <country4> | Europe | <max_inflation4> |
| <country5> | South America | <max_inflation5> |
| <country6> | Asia | <max_inflation6> |
+------------+---------------+-------------------+
Again, there are multiple ways to get to this solution using only joins, but the focus here is on showing you an introduction into advanced subqueries.
Instructions 1/3
- Create an inner join with
countries
on the left andeconomies
on the right withUSING
. Do not alias your tables or columns. - Retrieve the country name, continent, and inflation rate for 2015.
1
2
3
4
5
6
7
8
9
10
-- Select fields
SELECT name, continent, inflation_rate
-- From countries
FROM countries
-- Join to economies
INNER JOIN economies
-- Match on code
USING (code)
-- Where year is 2015
WHERE year = 2015;
1
2
3
4
5
6
7
8
9
10
11
sql_stmt = "\
SELECT name, continent, inflation_rate \
FROM countries.countries \
INNER JOIN countries.economies \
USING (code) \
WHERE year = 2015; \
"
inf_rate = pd.read_sql(sql_stmt, conn)
print(inf_rate.head())
print(inf_rate.tail())
1
2
3
4
5
6
7
8
9
10
11
12
name continent inflation_rate
0 Afghanistan Asia -1.549
1 Angola Africa 10.287
2 Albania Europe 1.896
3 United Arab Emirates Asia 4.070
4 Argentina South America NaN
name continent inflation_rate
180 Samoa Oceania 1.923
181 Yemen Asia 39.403
182 South Africa Africa 4.575
183 Zambia Africa 10.107
184 Zimbabwe Africa -2.410
Instructions 2/3
- Determine the maximum inflation rate for each continent in 2015 using the previous query as a subquery called
subquery
in theFROM
clause. - Select the maximum inflation rate
AS max_inf
grouped by continent. - This will result in the six maximum inflation rates in 2015 for the six continents as one field table. (Don’t include
continent
in the outerSELECT
statement.)
1
2
3
4
5
6
7
8
9
10
11
-- Select fields
SELECT max(inflation_rate) as max_inf
-- Subquery using FROM (alias as subquery)
FROM (
SELECT name, continent, inflation_rate
FROM countries
INNER JOIN economies
USING (code)
WHERE year = 2015) AS subquery
-- Group by continent
GROUP BY continent;
1
2
3
4
5
6
7
8
9
10
11
12
sql_stmt = "\
SELECT max(inflation_rate) as max_inf \
FROM ( \
SELECT name, continent, inflation_rate \
FROM countries.countries \
INNER JOIN countries.economies \
USING (code) \
WHERE year = 2015) AS subquery \
GROUP BY continent; \
"
pd.read_sql(sql_stmt, conn)
max_inf | |
---|---|
0 | 21.858 |
1 | 39.403 |
2 | 121.738 |
3 | 7.524 |
4 | 48.684 |
5 | 9.784 |
Instructions 3/3
- Append the second part’s query to the first part’s query using
WHERE
,AND
, andIN
to obtain the name of the country, its continent, and the maximum inflation rate for each continent in 2015. Revisit the sample output in the assignment text at the beginning of the exercise to see how this matches up. - For the sake of practice, change all joining conditions to use
ON
instead ofUSING
. - This code works since each of the six maximum inflation rate values occur only once in the 2015 data. Think about whether this particular code involving subqueries would work in cases where there are ties for the maximum inflation rate values.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Select fields
SELECT name, continent, inflation_rate
-- From countries
FROM countries
-- Join to economies
INNER JOIN economies
-- Match on code
ON countries.code = economies.code
-- Where year is 2015
WHERE year = 2015
-- And inflation rate in subquery (alias as subquery)
AND inflation_rate IN (
SELECT max(inflation_rate) as max_inf
FROM (
SELECT name, continent, inflation_rate
FROM countries
INNER JOIN economies
ON countries.code = economies.code
WHERE year = 2015) AS subquery
GROUP BY continent);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
sql_stmt = "\
SELECT name, continent, inflation_rate \
FROM countries.countries \
INNER JOIN countries.economies \
ON countries.code = economies.code \
WHERE year = 2015 \
AND inflation_rate IN ( \
SELECT max(inflation_rate) as max_inf \
FROM ( \
SELECT name, continent, inflation_rate \
FROM countries.countries \
INNER JOIN countries.economies \
ON countries.code = economies.code \
WHERE year = 2015) AS subquery \
GROUP BY continent); \
"
pd.read_sql(sql_stmt, conn)
name | continent | inflation_rate | |
---|---|---|---|
0 | Haiti | North America | 7.524 |
1 | Malawi | Africa | 21.858 |
2 | Nauru | Oceania | 9.784 |
3 | Ukraine | Europe | 48.684 |
4 | Venezuela | South America | 121.738 |
5 | Yemen | Asia | 39.403 |
Subquery challenge
Let’s test your understanding of the subqueries with a challenge problem! Use a subquery to get 2015 economic data for countries that do not have
gov_form
of'Constitutional Monarchy'
or'Republic'
in theirgov_form
.
Here, gov_form
stands for the form of the government for each country. Review the different entries for gov_form
in the countries
table.
Instructions
- Select the country code, inflation rate, and unemployment rate.
- Order by inflation rate ascending.
- Do not use table aliasing in this exercise.
1
2
3
4
5
6
7
8
9
10
11
12
-- Select fields
SELECT code, inflation_rate, unemployment_rate
-- From economies
FROM economies
-- Where year is 2015 and code is not in
WHERE year = 2015 AND code NOT IN
-- Subquery
(SELECT code
FROM countries
WHERE (gov_form = 'Constitutional Monarchy' OR gov_form LIKE '%Republic'))
-- Order by inflation rate
ORDER BY inflation_rate;
1
2
3
4
5
6
7
8
9
10
11
sql_stmt = "\
SELECT code, inflation_rate, unemployment_rate \
FROM countries.economies \
WHERE year = 2015 AND code NOT IN \
(SELECT code \
FROM countries.countries \
WHERE (gov_form = 'Constitutional Monarchy' OR gov_form LIKE '%%Republic')) \
ORDER BY inflation_rate; \
"
pd.read_sql(sql_stmt, conn)
code | inflation_rate | unemployment_rate | |
---|---|---|---|
0 | AFG | -1.549 | NaN |
1 | CHE | -1.140 | 3.178 |
2 | PRI | -0.751 | 12.000 |
3 | ROU | -0.596 | 6.812 |
4 | BRN | -0.423 | 6.900 |
5 | TON | -0.283 | NaN |
6 | OMN | 0.065 | NaN |
7 | TLS | 0.553 | NaN |
8 | BEL | 0.620 | 8.492 |
9 | CAN | 1.132 | 6.900 |
10 | MNE | 1.204 | NaN |
11 | SRB | 1.392 | 18.200 |
12 | AUS | 1.461 | 6.058 |
13 | QAT | 1.814 | NaN |
14 | BHR | 1.836 | NaN |
15 | WSM | 1.923 | NaN |
16 | MYS | 2.104 | 3.100 |
17 | SAU | 2.189 | 5.591 |
18 | HKG | 3.037 | 3.296 |
19 | KWT | 3.233 | 2.072 |
20 | ARE | 4.070 | NaN |
21 | MAC | 4.564 | 1.825 |
22 | SWZ | 4.960 | NaN |
23 | BTN | 6.336 | 3.200 |
24 | LBY | 9.839 | NaN |
25 | SSD | 52.813 | NaN |
Subquery review
Within which SQL clause are subqueries most frequently found?
Answer the question
- **WHERE**
FROMSELECTIN
Course Review
- In SQL, a join combines columns from one or more tables in a relational database via a lookup process.
- There are four types of joins covered in this course
Types of joins:
- INNER JOIN: also denoted as JOIN
- Self-joins: special case
- OUTER JOIN
- LEFT JOIN: also denoted as LEFT OUTER JOIN
- RIGHT JOIN: also denoted as RIGHT OUTER JOIN
- FULL JOIN: also denoted as FULL OUTER JOIN
- CROSS JOIN: create all possible combinations between two tables
- Semi-join / Anti-join
Notes
- Words appearing in ALL capital letters correspond to joins having simple SQL syntax.
- Self-joins, semi-joins, and anti-joins don’t have built-in SQL syntax.
- An INNER JOIN keeps only the records in which the key field (or fields) is in both tables.
- A LEFT JOIN keeps all the records in fields specified in the left table and includes the matches in the right table based on the key field or fields.
- Key field values that don’t match in the right table are included as missing data in the resulting table of a LEFT JOIN.
- A RIGHT JOIN keeps all the records specified in the right table and includes the matches from the key field(s) in the left table.
- THose that don’t match are included as missing values in the resulting table from the RIGHT JOIN query.
- A FULL JOIN is a combination of a LEFT JOIN and a RIGHT JOIN showing exactly which values appear in both tables and those that appear in only one or the other table.
- A CROSS JOIN matches all records from fields specified in one table with all records from fields specified in another table.
- Remember that a
CROSS JOIN
does not have anON
orUSING
clause, but otherwise looks very similar to the code for anINNER JOIN
,LEFT JOIN
,RIGHT JOIN
, orFULL JOIN
.
Set Theory Clauses
- Recall that
UNION
includes every record in both tables but DOES NOT double count those that are in both tables. UNION ALL
does replicate those that are in both tables.INTERSECT
gives only those records found in both of the two tables.EXCEPT
gives only those records in one table but not the other.
Semi-joins and Anti-joins
- When you’d like to filter your first table based on conditions set on a second table, you should use a semi-join to accomplish the task.
- If instead you’d like to filter the first table based on conditions NOT being met on a second table, you should use an anti-join.
- Anti-joins are particularly useful in diagnosing problems with other joins in terms of getting fewer or more records than expected.
Types of basic subqueries
- The most common type of subquery is done inside of a
WHERE
clause. - The next most frequent types of subqueries are inside
SELECT
clauses and insideFROM
clauses. - Subqueries can also find their way into the ON statement of a join in ways similar to what you’ve seen inside
WHERE
clauses too.
Final Challenge
Welcome to the end of the course! The next three exercises will test your knowledge of the content covered in this course and apply many of the ideas you’ve seen to difficult problems. Good luck!
Read carefully over the instructions and solve them step-by-step, thinking about how the different clauses work together.
In this exercise, you’ll need to get the country names and other 2015 data in the economies
table and the countries
table for Central American countries with an official language.
Instructions
- Select unique country names. Also select the total investment and imports fields.
- Use a left join with
countries
on the left. (An inner join would also work, but please use a left join here.) - Match on
code
in the two tablesAND
use a subquery inside ofON
to choose the appropriatelanguages
records. - Order by country name ascending.
- Use table aliasing but not field aliasing in this exercise.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Select fields
SELECT DISTINCT c.name, e.total_investment, e.imports
-- From table (with alias)
FROM countries AS c
-- Join with table (with alias)
LEFT JOIN economies AS e
-- Match on code
ON (c.code = e.code
-- and code in Subquery
AND c.code IN (
SELECT l.code
FROM languages AS l
WHERE official = 'true'
) )
-- Where region and year are correct
WHERE year = 2015 AND region = 'Central America'
-- Order by field
ORDER BY name;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
sql_stmt = "\
SELECT DISTINCT c.name, e.total_investment, e.imports \
FROM countries.countries AS c \
LEFT JOIN countries.economies AS e \
ON (c.code = e.code \
AND c.code IN ( \
SELECT l.code \
FROM countries.languages AS l \
WHERE official = 'true' \
) ) \
WHERE year = 2015 AND region = 'Central America' \
ORDER BY name; \
"
pd.read_sql(sql_stmt, conn)
name | total_investment | imports | |
---|---|---|---|
0 | Belize | 22.014 | 6.743 |
1 | Costa Rica | 20.218 | 4.629 |
2 | El Salvador | 13.983 | 8.193 |
3 | Guatemala | 13.433 | 15.124 |
4 | Honduras | 24.633 | 9.353 |
5 | Nicaragua | 31.862 | 11.665 |
6 | Panama | 46.557 | 5.898 |
Final Challenge (2)
Whoofta! That was challenging, huh?
Let’s ease up a bit and calculate the average fertility rate for each region in 2015.
Instructions
- Include the name of region, its continent, and average fertility rate aliased as
avg_fert_rate
. - Sort based on
avg_fert_rate
ascending. - Remember that you’ll need to
GROUP BY
all fields that aren’t included in the aggregate function ofSELECT
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Select fields
SELECT c.region, c.continent, AVG(p.fertility_rate) AS avg_fert_rate
-- From left table
FROM populations AS p
-- Join to right table
INNER JOIN countries AS c
-- Match on join condition
ON p.country_code = c.code
-- Where specific records matching some condition
WHERE year = 2015
-- Group appropriately
GROUP BY c.continent, c.region
-- Order appropriately
ORDER BY avg_fert_rate;
1
2
3
4
5
6
7
8
9
10
11
sql_stmt = "\
SELECT c.region, c.continent, AVG(p.fertility_rate) AS avg_fert_rate \
FROM countries.populations AS p \
INNER JOIN countries.countries AS c \
ON p.country_code = c.code \
WHERE year = 2015 \
GROUP BY c.continent, c.region \
ORDER BY avg_fert_rate; \
"
pd.read_sql(sql_stmt, conn)
region | continent | avg_fert_rate | |
---|---|---|---|
0 | Southern Europe | Europe | 1.426100 |
1 | Eastern Europe | Europe | 1.490889 |
2 | Baltic Countries | Europe | 1.603333 |
3 | Eastern Asia | Asia | 1.620714 |
4 | Western Europe | Europe | 1.632500 |
5 | North America | North America | 1.765750 |
6 | British Islands | Europe | 1.875000 |
7 | Nordic Countries | Europe | 1.893333 |
8 | Australia and New Zealand | Oceania | 1.911500 |
9 | Caribbean | North America | 1.950571 |
10 | Southeast Asia | Asia | 2.156000 |
11 | South America | South America | 2.274750 |
12 | Central America | North America | 2.326375 |
13 | Middle East | Asia | 2.547056 |
14 | Southern and Central Asia | Asia | 2.634143 |
15 | Micronesia | Oceania | 2.864750 |
16 | Northern Africa | Africa | 2.908167 |
17 | Southern Africa | Africa | 2.990800 |
18 | Melanesia | Oceania | 3.135800 |
19 | Polynesia | Oceania | 3.244333 |
20 | Eastern Africa | Africa | 4.386706 |
21 | Western Africa | Africa | 4.960125 |
22 | Central Africa | Africa | 4.967889 |
Final Challenge (3)
Welcome to the last challenge problem. By now you’re a query warrior! Remember that these challenges are designed to take you to the limit to solidify your SQL knowledge! Take a deep breath and solve this step-by-step.
You are now tasked with determining the top 10 capital cities in Europe and the Americas in terms of a calculated percentage using city_proper_pop
and metroarea_pop
in cities
.
Do not use table aliasing in this exercise.
Instructions
- Select the city name, country code, city proper population, and metro area population.
- Calculate the percentage of metro area population composed of city proper population for each city in
cities
, aliased ascity_perc
. - Focus only on capital cities in Europe and the Americas in a subquery.
- Make sure to exclude records with missing data on metro area population.
- Order the result by
city_perc
descending. - Then determine the top 10 capital cities in Europe and the Americas in terms of this
city_perc
percentage.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Select fields
SELECT name, country_code, city_proper_pop, metroarea_pop,
-- Calculate city_perc
city_proper_pop / metroarea_pop * 100 AS city_perc
-- From appropriate table
FROM cities
-- Where
WHERE name IN
-- Subquery
(SELECT capital
FROM countries
WHERE (continent = 'Europe'
OR continent LIKE '%America'))
AND metroarea_pop IS NOT NULL
-- Order appropriately
ORDER BY city_perc desc
-- Limit amount
LIMIT 10;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
sql_stmt = "\
SELECT name, country_code, city_proper_pop, metroarea_pop, \
city_proper_pop / metroarea_pop * 100 AS city_perc \
FROM countries.cities \
WHERE name IN \
(SELECT capital \
FROM countries.countries \
WHERE (continent = 'Europe' \
OR continent LIKE '%%America')) \
AND metroarea_pop IS NOT NULL \
ORDER BY city_perc desc \
LIMIT 10; \
"
pd.read_sql(sql_stmt, conn)
name | country_code | city_proper_pop | metroarea_pop | city_perc | |
---|---|---|---|---|---|
0 | Lima | PER | 8852000.0 | 10750000.0 | 82.344186 |
1 | Bogota | COL | 7878783.0 | 9800000.0 | 80.395746 |
2 | Moscow | RUS | 12197596.0 | 16170000.0 | 75.433493 |
3 | Vienna | AUT | 1863881.0 | 2600000.0 | 71.687728 |
4 | Montevideo | URY | 1305082.0 | 1947604.0 | 67.009616 |
5 | Caracas | VEN | 1943901.0 | 2923959.0 | 66.481817 |
6 | Rome | ITA | 2877215.0 | 4353775.0 | 66.085523 |
7 | Brasilia | BRA | 2556149.0 | 3919864.0 | 65.210146 |
8 | London | GBR | 8673713.0 | 13879757.0 | 62.491822 |
9 | Budapest | HUN | 1759407.0 | 2927944.0 | 60.090184 |
1
conn.close()