Course Description
You’ve already used SQL to query data from databases. But did you know that there's a lot more you can do with databases? You can model different phenomena in your data, as well as the relationships between them. This gives your data structure and consistency, which results in better data quality. In this course, you'll experience this firsthand by working with a real-life dataset that was used to investigate questionable university affiliations. Column by column, table by table, you'll get to unlock and admire the full potential of databases. You'll learn how to create tables and specify their relationships, as well as how to enforce data integrity. You'll also discover other unique features of database systems, such as constraints.
Datasets
Imports
from sqlalchemy import create_engine
from sqlalchemy import MetaData
from sqlalchemy import Table
from sqlalchemy import Column
from sqlalchemy import Integer, String
from sqlalchemy import inspect
from sqlalchemy.types import Text
import pandas as pd
from pprint import pprint as pp
from itertools import combinations
Pandas Configuration Options
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)
PostgreSQL Connection
engine
to use your database username and password.engine = create_engine('postgresql://postgres:postgres@localhost/postgres')
conn = engine.connect()
university_professors
table¶Create an empty table with conn
conn.execute('CREATE SCHEMA uni;')
conn.execute(' \
CREATE TABLE uni.university_professors ( \
"firstname" text, \
"lastname" text, \
"university" text, \
"university_shortname" text, \
"university_city" text, \
"function" text, \
"organization" text, \
"organization_sector" text \
);')
Create the filled table from pandas
uni_prof = pd.read_excel('data/2019-12-13_intro_to_relational_databases/university_professors.xlsx')
uni_prof.head()
_dtypes = {'firstname': Text(), 'lastname': Text(), 'university': Text(), 'university_shortname': Text(),
'university_city': Text(), 'function': Text(), 'organization': Text(), 'organization_sector': Text()}
uni_prof.to_sql('university_professors', con=engine, schema='uni', index=False, if_exists='replace', dtype=_dtypes)
In this chapter, you'll create your very first database with a set of simple SQL commands. Next, you'll migrate data from existing flat tables into that database. You'll also learn how meta-information about a database can be queried.
In the video, we talked about some basic facts about relational databases. Which of the following statements does not hold true for databases? Relational databases ...
Possible Answers
SELECT
¶information_schema
is a meta-database that holds information about your current database. information_schema
has multiple tables you can query with the known SELECT * FROM
syntax:
tables
: information about all tables in your current databasecolumns
: information about all columns in all of the tables in your current databaseIn this exercise, you'll only need information from the 'public'
schema, which is specified as the column table_schema
of the tables
and columns
tables. The 'public'
schema holds information about user-defined tables and databases. The other types of table_schema
hold system information - for this course, you're only interested in user-defined stuff.
Instructions 1/4
'public'
table_schema
.-- Query the right table in information_schema
SELECT table_name
FROM information_schema.tables
-- Specify the correct table_schema value
WHERE table_schema = 'public';
stmt = "\
SELECT table_name \
FROM information_schema.tables \
WHERE table_schema = 'uni'; \
"
res = conn.execute(stmt)
res.fetchall()
Instructions 2/4
university_professors
by selecting all entries in information_schema.columns
that correspond to that table.-- Query the right table in information_schema to get columns
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'university_professors' AND table_schema = 'public';
stmt = "\
SELECT column_name, data_type \
FROM information_schema.columns \
WHERE table_name = 'university_professors' AND table_schema = 'uni'; \
"
res = conn.execute(stmt)
res.fetchall()
Instructions 3/4
How many columns does the table university_professors have?
Possible Answers
Instructions 4/4
university_professors
table.-- Query the first five rows of our table
SELECT *
FROM university_professors
LIMIT 5;
stmt = "\
SELECT * \
FROM uni.university_professors \
LIMIT 5; \
"
pd.read_sql(stmt, conn)
CREATE
your first few TABLE
s¶You'll now start implementing a better database model. For this, you'll create tables for the professors
and universities
entity types. The other tables will be created for you.
The syntax for creating simple tables is as follows:
CREATE TABLE table_name (
column_a data_type,
column_b data_type,
column_c data_type
);
Attention: Table and columns names, as well as data types, don't need to be surrounded by quotation marks.
Instructions 1/2
professors
with two text
columns: firstname
and lastname
.-- Create a table for the professors entity type
CREATE TABLE professors (
firstname text,
lastname text
);
-- Print the contents of this table
SELECT *
FROM professors
stmt = "\
CREATE TABLE uni.professors ( \
firstname text, \
lastname text \
); \
"
conn.execute(stmt)
stmt = "\
SELECT * \
FROM uni.professors \
"
pd.read_sql(stmt, conn)
Instructions 2/2
universities
with three text columns: university_shortname
, university
, and university_city
.-- Create a table for the universities entity type
CREATE TABLE universities (
university_shortname text,
university text,
university_city text);
-- Print the contents of this table
SELECT *
FROM universities
stmt = "\
CREATE TABLE uni.universities ( \
university_shortname text, \
university text, \
university_city text); \
"
conn.execute(stmt)
stmt = "\
SELECT * \
FROM uni.universities \
"
pd.read_sql(stmt, conn)
ADD
a COLUMN
with ALTER TABLE
¶We forgot to add the university_shortname
column to the professors
table. You've probably already noticed:
In chapter 4 of this course, you'll need this column for connecting the professors
table with the universities
table.
However, adding columns to existing tables is easy, especially if they're still empty.
To add columns you can use the following SQL query:
ALTER TABLE table_name
ADD COLUMN column_name data_type;
Instructions
-- Add the university_shortname column
ALTER TABLE professors
ADD COLUMN university_shortname text;
-- Print the contents of this table
SELECT *
FROM professors
stmt = "\
ALTER TABLE uni.professors \
ADD COLUMN university_shortname text; \
"
conn.execute(stmt)
stmt = "\
SELECT * \
FROM uni.professors \
"
pd.read_sql(stmt, conn)
Create the affiliations
table
CREATE TABLE affiliations (
firstname text,
lastname text,
university_shortname text,
function text,
organisation text
);
stmt = "\
CREATE TABLE uni.affiliations ( \
firstname text, \
lastname text, \
university_shortname text, \
function text, \
organisation text \
); \
"
conn.execute(stmt)
Create the organizations
table
CREATE TABLE organizations (
organization text,
organization_sector text
);
stmt = "\
CREATE TABLE uni.organizations ( \
organization text, \
organization_sector text \
); \
"
conn.execute(stmt)
RENAME
and DROP COLUMN
s in affiliations¶As mentioned in the video, the still empty affiliations table has some flaws. In this exercise, you'll correct them as outlined in the video.
You'll use the following queries:
To rename columns:
ALTER TABLE table_name
RENAME COLUMN old_name TO new_name;
To delete columns:
ALTER TABLE table_name
DROP COLUMN column_name;
Instructions 1/2
organisation
column to organization
in affiliations
.-- Rename the organisation column
ALTER TABLE affiliations
RENAME COLUMN organisation TO organization;
stmt = "\
ALTER TABLE uni.affiliations \
RENAME COLUMN organisation TO organization; \
"
conn.execute(stmt)
stmt = "\
SELECT * \
FROM uni.affiliations \
"
pd.read_sql(stmt, conn)
Instructions 2/2
university_shortname
column in affiliations
.-- Delete the university_shortname column
ALTER TABLE affiliations
DROP COLUMN university_shortname;
stmt = "\
ALTER TABLE uni.affiliations \
DROP COLUMN university_shortname; \
"
conn.execute(stmt)
stmt = "\
SELECT * \
FROM uni.affiliations \
"
pd.read_sql(stmt, conn)
INSERT INTO SELECT DISTINCT
¶Now it's finally time to migrate the data into the new tables. You'll use the following pattern:
INSERT INTO ...
SELECT DISTINCT ...
FROM ...;
It can be broken up into two parts:
First part:
SELECT DISTINCT column_name1, column_name2, ...
FROM table_a;
This selects all distinct values in table table_a
- nothing new for you.
Second part:
INSERT INTO table_b ...;
Take this part and append it to the first, so it inserts all distinct rows from table_a
into table_b
.
One last thing: It is important that you run all of the code at the same time once you have filled out the blanks.
Instructions 1/2
DISTINCT
professors from university_professors
into professors
.professors
.-- Insert unique professors into the new table
INSERT INTO professors
SELECT DISTINCT firstname, lastname, university_shortname
FROM university_professors;
-- Doublecheck the contents of professors
SELECT *
FROM professors;
stmt = "\
INSERT INTO uni.professors \
SELECT DISTINCT firstname, lastname, university_shortname \
FROM uni.university_professors; \
"
conn.execute(stmt)
stmt = "\
SELECT * \
FROM uni.professors \
"
pd.read_sql(stmt, conn)
Instructions 2/2
DISTINCT
affiliations into affiliations
from university_professors
.-- Insert unique affiliations into the new table
INSERT INTO affiliations
SELECT DISTINCT firstname, lastname, function, organization
FROM university_professors;
-- Doublecheck the contents of affiliations
SELECT *
FROM affiliations;
stmt = "\
INSERT INTO uni.affiliations \
SELECT DISTINCT firstname, lastname, function, organization \
FROM uni.university_professors; \
"
conn.execute(stmt)
stmt = "\
SELECT * \
FROM uni.affiliations \
"
pd.read_sql(stmt, conn)
Migrate data to the universities
and organizations
tables*
stmt = "\
INSERT INTO uni.universities \
SELECT DISTINCT university_shortname, university, university_city \
FROM uni.university_professors; \
"
conn.execute(stmt)
stmt = "\
SELECT * \
FROM uni.universities \
"
pd.read_sql(stmt, conn)
stmt = "\
INSERT INTO uni.organizations \
SELECT DISTINCT organization, organization_sector \
FROM uni.university_professors; \
"
conn.execute(stmt)
stmt = "\
SELECT * \
FROM uni.organizations \
"
pd.read_sql(stmt, conn)
DROP TABLE
¶The university_professors
table is now no longer needed and can safely be deleted.
For table deletion, you can use the simple command:
DROP TABLE table_name;
Instructions
-- Delete the university_professors table
DROP TABLE university_professors;
After building a simple database, it's now time to make use of the features. You'll specify data types in columns, enforce column uniqueness, and disallow NULL values in this chapter.
Integrity Constrains
Why should you know about constraints?
Data types as attribute constaints
Name | Aliases | Description |
---|---|---|
bigint | int8 | signed eight-byte integer |
bigserial | serial8 | autoincrementing eight-byte integer |
bit [ (n) ] | fixed-length bit string | |
bit varying [ (n) ] | varbit [ (n) ] | variable-length bit string |
boolean | bool | logical Boolean (true/false) |
box | rectangular box on a plane | |
bytea | binary data (“byte array”) | |
character [ (n) ] | char [ (n) ] | fixed-length character string |
character varying [ (n) ] | varchar [ (n) ] | variable-length character string |
cidr | IPv4 or IPv6 network address | |
circle | circle on a plane | |
date | calendar date (year, month, day) | |
double precision | float8 | double precision floating-point number (8 bytes) |
inet | IPv4 or IPv6 host address | |
integer | int, int4 | signed four-byte integer |
interval [ fields ] [ (p) ] | time span | |
json | textual JSON data | |
jsonb | binary JSON data, decomposed | |
line | infinite line on a plane | |
lseg | line segment on a plane | |
macaddr | MAC (Media Access Control) address | |
macaddr8 | MAC (Media Access Control) address (EUI-64 format) | |
money | currency amount | |
numeric [ (p, s) ] | decimal [ (p, s) ] | exact numeric of selectable precision |
path | geometric path on a plane | |
pg_lsn | PostgreSQL Log Sequence Number | |
point | geometric point on a plane | |
polygon | closed geometric path on a plane | |
real | float4 | single precision floating-point number (4 bytes) |
smallint | int2 | signed two-byte integer |
smallserial | serial2 | autoincrementing two-byte integer |
serial | serial4 | autoincrementing four-byte integer |
text | variable-length character string | |
time [ (p) ] [ without time zone ] | time of day (no time zone) | |
time [ (p) ] with time zone | timetz | time of day, including time zone |
timestamp [ (p) ] [ without time zone ] | date and time (no time zone) | |
timestamp [ (p) ] with time zone | timestamptz | date and time, including time zone |
tsquery | text search query | |
tsvector | text search document | |
txid_snapshot | user-level transaction ID snapshot | |
uuid | universally unique identifier | |
xml | XML data |
**Dealing with data types (casting)
SELECT temperature * wind_speed AS wind_chill FROM weather;
operator does not exist: integer * text HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.
SELECT temperature * CAST(wind_speed AS integer) AS wind_chill FROM weather; ```
wind_speed
may store numbers, but PostgreSQL doesn't know how to use text in a calculation.CAST
function, followed by the column name, the AS
keyword, and the desired data type. PostgreSQL will run wind_speed
into an integer before the calculation.Which of the following is not used to enforce a database constraint?
Possible Answers
For demonstration purposes, I created a fictional database table that only holds three records. The columns have the data types date
, integer
, and text
, respectively.
CREATE TABLE transactions (
transaction_date date,
amount integer,
fee text
);
transaction_date | amount | fee |
---|---|---|
1999-01-08 | 500 | 20 |
2001-02-20 | 403 | 15 |
2001-03-20 | 3430 | 35 |
2018-09-24 | 5454 | 30 |
1999-01-08 | 500 | 20 |
2001-02-20 | 403 | 15 |
2001-03-20 | 3430 | 35 |
2018-09-24 | 5454 | 30 |
1999-01-08 | 500 | 20 |
2001-02-20 | 403 | 15 |
2001-03-20 | 3430 | 35 |
1999-01-08 | 500 | 20 |
2001-02-20 | 403 | 15 |
2001-03-20 | 3430 | 35 |
Have a look at the contents of the transactions
table.
The transaction_date
accepts date
values. According to the PostgreSQL documentation, it accepts values in the form of YYYY-MM-DD
, DD/MM/YY
, and so forth.
Both columns amount
and fee
appear to be numeric, however, the latter is modeled as text
- which you will account for in the next exercise.
Instructions
-- Let's add a record to the table
INSERT INTO transactions (transaction_date, amount, fee)
VALUES ('2018-24-09', 5454, '30');
-- Doublecheck the contents
SELECT *
FROM transactions;
date/time field value out of range: "2018-24-09"
LINE 3: VALUES ('2018-24-09', 5454, '30');
^
HINT: Perhaps you need a different "datestyle" setting.
-- Let's add a record to the table
INSERT INTO transactions (transaction_date, amount, fee)
VALUES ('2018-09-24', 5454, '30');
-- Doublecheck the contents
SELECT *
FROM transactions;
transaction_date | amount | fee |
---|---|---|
1999-01-08 | 500 | 20 |
2001-02-20 | 403 | 15 |
2001-03-20 | 3430 | 35 |
2018-09-24 | 5454 | 30 |
1999-01-08 | 500 | 20 |
2001-02-20 | 403 | 15 |
2001-03-20 | 3430 | 35 |
2018-09-24 | 5454 | 30 |
1999-01-08 | 500 | 20 |
2001-02-20 | 403 | 15 |
2001-03-20 | 3430 | 35 |
2018-09-24 | 5454 | 30 |
CAST
s¶In the video, you saw that type casts are a possible solution for data type issues. If you know that a certain column stores numbers as text
, you can cast the column to a numeric form, i.e. to integer
.
SELECT CAST(some_column AS integer)
FROM table;
Now, the some_column
column is temporarily represented as integer
instead of text
, meaning that you can perform numeric calculations on the column.
Instructions
integer
type cast at the right place and execute it again.-- Calculate the net amount as amount + fee
SELECT transaction_date, amount + fee AS net_amount
FROM transactions;
operator does not exist: integer + text
LINE 2: SELECT transaction_date, amount + fee AS net_amount
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
-- Calculate the net amount as amount + fee
SELECT transaction_date, amount + CAST(fee AS integer) AS net_amount
FROM transactions;
transaction_date | net_amount |
---|---|
1999-01-08 | 520 |
2001-02-20 | 418 |
2001-03-20 | 3465 |
2018-09-24 | 5484 |
THe most common data types in PostgreSQL
bigint
for larger numbersSpecifying type upon table creation
CREATE TABLE students (
ssn integer,
name varchar(64),
dob date,
average_grade numeric(3, 2), -- e.g. 5.54
tuition_paid boolean
);
Alter types after table creation
ALTER TABLE students
ALTER COLUMN name
TYPE varchar(128);
USING
keyword can be used to specify a transformation that should happen before the type is altered.ALTER TABLE students
ALTER COLUMN average_grade
TYPE integer
-- Turns 5.54 into 6, not 5, before type conversion
USING ROUND(average_grade);
USING
, have the number rounded to the nearest integer, for example.ALTER COLUMN
¶The syntax for changing the data type of a column is straightforward. The following code changes the data type of the column_name
column in table_name
to varchar(10)
:
ALTER TABLE table_name
ALTER COLUMN column_name
TYPE varchar(10)
Now it's time to start adding constraints to your database.
Instructions 1/3
university_shortname
values in the professors
table and take note of the length of the strings.-- Select the university_shortname column
SELECT distinct(university_shortname)
FROM professors;
stmt = "\
SELECT distinct(university_shortname) \
FROM uni.professors; \
"
pd.read_sql(stmt, conn)
Instruction 2/3
university_shortname
.-- Specify the correct fixed-length character type
ALTER TABLE professors
ALTER COLUMN university_shortname
TYPE char(3);
stmt = "\
ALTER TABLE uni.professors \
ALTER COLUMN university_shortname \
TYPE varchar(3); \
"
conn.execute(stmt)
Instructions 3/3
firstname
column to varchar(64)
.-- Change the type of firstname
ALTER TABLE professors
ALTER COLUMN firstname
TYPE varchar(64);
stmt = "\
ALTER TABLE uni.professors \
ALTER COLUMN firstname \
TYPE varchar(64); \
"
conn.execute(stmt)
If you don't want to reserve too much space for a certain varchar
column, you can truncate the values before converting its type.
For this, you can use the following syntax:
ALTER TABLE table_name
ALTER COLUMN column_name
TYPE varchar(x)
USING SUBSTRING(column_name FROM 1 FOR x)
You should read it like this: Because you want to reserve only x
characters for column_name
, you have to retain a SUBSTRING
of every value, i.e. the first x
characters of it, and throw away the rest. This way, the values will fit the varchar(x)
requirement.
Instructions
-- Convert the values in firstname to a max. of 16 characters
ALTER TABLE professors
ALTER COLUMN firstname
TYPE varchar(16)
value too long for type character varying(16)
SUBSTRING()
to reduce firstname
to 16 characters so its type can be altered to varchar(16)
.-- Convert the values in firstname to a max. of 16 characters
ALTER TABLE professors
ALTER COLUMN firstname
TYPE varchar(16)
USING SUBSTRING(firstname FROM 1 FOR 16)
varchar(64)
. Now it's time to move on to the next set of attribute constraints!not null
constraint¶NULL
values in a column.NULL
valuesNULL
values in the futureNULL
meanCREATE TABLE students (
ssn integer not null,
lastname varchar(64) not null,
home_phone integer,
office_phone integer
);
NULL
values must not have the same meaningNULL
with NULL
always results in a FALSE
valuenot null
constraints can be added to or removed from existing tablesAdd not null to an existing table
ALTER TABLE students
ALTER COLUMN home_phone
SET NOT NULL;
Drop not null from an existing table
ALTER TABLE students
ALTER COLUMN ssn
DROP NOT NULL;
unique
constraint¶CREATE TABLE table_name (
column_name UNIQUE
);
Alter an existing table
ALTER TABLE table_name
ADD CONSTRAINT some_name UNIQUE(column_name);
NULL
values with SET NOT NULL
¶The professors
table is almost ready now. However, it still allows for NULL
s to be entered. Although some information might be missing about some professors, there's certainly columns that always need to be specified.
Instructions 1/2
firstname
column.-- Disallow NULL values in firstname
ALTER TABLE professors
ALTER COLUMN firstname SET NOT NULL;
stmt = "\
ALTER TABLE uni.professors \
ALTER COLUMN firstname SET NOT NULL; \
"
conn.execute(stmt)
Instructions 2/2
lastname
column.-- Disallow NULL values in lastname
ALTER TABLE professors
ALTER COLUMN lastname SET NOT NULL;
stmt = "\
ALTER TABLE uni.professors \
ALTER COLUMN lastname SET NOT NULL; \
"
conn.execute(stmt)
Good job - it is no longer possible to add professors which have either their first or last name set to NULL
. Likewise, it is no longer possible to update an existing professor and setting their first or last name to NULL
.
Execute the following statement:
INSERT INTO professors (firstname, lastname, university_shortname)
VALUES (NULL, 'Miller', 'ETH');
Why does this throw an error?
Possible Answers
NULL
is not put in quotes.As seen in the video, you add the UNIQUE
keyword after the column_name
that should be unique. This, of course, only works for new tables:
CREATE TABLE table_name (
column_name UNIQUE
);
If you want to add a unique constraint to an existing table, you do it like that:
ALTER TABLE table_name
ADD CONSTRAINT some_name UNIQUE(column_name);
Note that this is different from the ALTER COLUMN
syntax for the not-null constraint. Also, you have to give the constraint a name some_name
.
Instructions 1/2
university_shortname
column in universities
. Give it the name university_shortname_unq
.-- Make universities.university_shortname unique
ALTER TABLE universities
ADD CONSTRAINT university_shortname_unq UNIQUE(university_shortname);
stmt = "\
ALTER TABLE uni.universities \
ADD CONSTRAINT university_shortname_unq UNIQUE(university_shortname); \
"
conn.execute(stmt)
Instructions 2/2
organization
column in organizations
. Give it the name organization_unq
.-- Make organizations.organization unique
ALTER TABLE organizations
ADD CONSTRAINT organization_unq UNIQUE(organization);
stmt = "\
ALTER TABLE uni.organizations \
ADD CONSTRAINT organization_unq UNIQUE(organization); \
"
conn.execute(stmt)
Perfect. Making sure universities.university_shortname
and organizations.organization
only contain unique values is a prerequisite for turning them into so-called primary keys – the subject of the next chapter!
Now let's get into the best practices of database engineering. It's time to add primary and foreign keys to the tables. These are two of the most important concepts in databases, and are the building blocks you'll use to establish relationships between tables.
id
, will be added to three different tables (organizations
, professors
and universities
)| license_no | serial_no | make | model | year |
|--------------------|-----------|------------|---------|------|
| Texas ABC-739 | A69352 | Ford | Mustang | 2 |
| Florida TVP-347 | B43696 | Oldsmobile | Cutlass | 5 |
| New York MPO-22 | X83554 | Oldsmobile | Delta | 1 |
| California 432-TFY | C43742 | Mercedes | 190-D | 99 |
| California RSK-629 | Y82935 | Toyota | Camry | 4 |
| Texas RSK-629 | U028365 | Jaguar | XJS | 4 |
Superkeys
year
attribute from the superkey, the six records are still unique, so it's still a superkeyCandidate keys
year
from K4, make
would contain duplicates, and would, therefore, no longer be suited as a keySELECT COUNT DISTINCT
¶Your database doesn't have any defined keys so far, and you don't know which columns or combinations of columns are suited as keys.
There's a simple way of finding out whether a certain column (or a combination) contains only unique values - and thus identifies the records in the table.
You already know the SELECT DISTINCT
query from the first chapter. Now you just have to wrap everything within the COUNT()
function and PostgreSQL will return the number of unique rows for the given columns:
SELECT COUNT(DISTINCT(column_a, column_b, ...))
FROM table;
Instructions 1/2
universities
.-- Count the number of rows in universities
SELECT count(*)
FROM universities;
stmt = "\
SELECT count(*) \
FROM uni.universities; \
"
res = conn.execute(stmt)
res.fetchall()
Instructions 2/2
university_city
column.-- Count the number of distinct values in the university_city column
SELECT COUNT(DISTINCT(university_city))
FROM universities;
stmt = "\
SELECT COUNT(DISTINCT(university_city)) \
FROM uni.universities; \
"
res = conn.execute(stmt)
res.fetchall()
The university_city
column wouldn't lend itself as a key because there are only 9 distinct values, but the table has 11 rows.
SELECT COUNT DISTINCT
¶There's a very basic way of finding out what qualifies for a key in an existing, populated table:
Count the distinct records for all possible combinations of columns. If the resulting number x
equals the number of all rows in the table for a combination, you have discovered a superkey.
Then remove one column after another until you can no longer remove columns without seeing the number x
decrease. If that is the case, you have discovered a (candidate) key.
The table professors
has 551 rows. It has only one possible candidate key, which is a combination of two attributes. You might want to try different combinations using the "Run code" button. Once you have found the solution, you can submit your answer.
Instructions
attribs = ['firstname', 'lastname', 'university_shortname']
combos = sum([list(map(list, combinations(attribs, i))) for i in range(len(attribs) + 1)[1:]], [])
for x in combos:
att_str = ', '.join(x)
stmt = f"SELECT COUNT(DISTINCT({att_str})) FROM uni.professors;"
res = conn.execute(stmt)
print(att_str, res.fetchall())
The only combination that uniquely identifies professors
is {firstname, lastname}
. {firstname, lastname, university_shortname}
is a superkey, and all other combinations give duplicate values. Hopefully, the concept of superkeys and keys is now a bit more clear. Let's move on to primary keys!
CREATE TABLE products (
product_no integer UNIQUE NOT NULL,
name text,
price numeric
);
CREATE TABLE products (
product_no integer PRIMARY KEY,
name text,
price numeric
);
CREATE TABLE example (
a integer,
b integer,
c integer,
PRIMARY KEY (a, c)
);
ALTER TABLE table_name
ADD CONSTRAINT some_name PRIMARY KEY (column_name)
universities
and organizations
.professors
in the last part of this chapter.Have a look at the example table from the previous video. As the database designer, you have to make a wise choice as to which column should be the primary key.
license_no | serial_no | make | model | year
--------------------+-----------+------------+---------+------
Texas ABC-739 | A69352 | Ford | Mustang | 2
Florida TVP-347 | B43696 | Oldsmobile | Cutlass | 5
New York MPO-22 | X83554 | Oldsmobile | Delta | 1
California 432-TFY | C43742 | Mercedes | 190-D | 99
California RSK-629 | Y82935 | Toyota | Camry | 4
Texas RSK-629 | U028365 | Jaguar | XJS | 4
Which of the following column or column combinations could best serve as primary key?
Possible Answers
license_no
is probably the wisest choice, as license numbers are certainly unique across all registered cars in a country.ADD
key CONSTRAINT
s to the tables¶Two of the tables in your database already have well-suited candidate keys consisting of one column each: organizations
and universities
with the organization
and university_shortname
columns, respectively.
In this exercise, you'll rename these columns to id
using the RENAME COLUMN
command and then specify primary key constraints for them. This is as straightforward as adding unique constraints (see the last exercise of Chapter 2):
ALTER TABLE table_name
ADD CONSTRAINT some_name PRIMARY KEY (column_name)
Note that you can also specify more than one column in the brackets.
Instructions 1/2
organization
column to id
in organizations
.id
a primary key and name it organization_pk
.-- Rename the organization column to id
ALTER TABLE organizations
RENAME COLUMN organization TO id;
-- Make id a primary key
ALTER TABLE organizations
ADD CONSTRAINT organization_pk PRIMARY KEY (id);
stmt = "\
ALTER TABLE uni.organizations \
RENAME COLUMN organization TO id; \
"
conn.execute(stmt)
stmt = "\
ALTER TABLE uni.organizations \
ADD CONSTRAINT organization_pk PRIMARY KEY (id); \
"
conn.execute(stmt)
Instructions 2/2
university_shortname
column to id
in universities
.id
a primary key and name it university_pk
.-- Rename the university_shortname column to id
ALTER TABLE universities
RENAME COLUMN university_shortname TO id;
-- Make id a primary key
ALTER TABLE universities
ADD CONSTRAINT university_pk PRIMARY KEY (id);
stmt = "\
ALTER TABLE uni.universities \
RENAME COLUMN university_shortname TO id; \
"
conn.execute(stmt)
stmt = "\
ALTER TABLE uni.universities \
ADD CONSTRAINT university_pk PRIMARY KEY (id); \
"
conn.execute(stmt)
| license_no | serial_no | make | model | color |
|--------------------|-----------|------------|---------|-----------|
| Texas ABC-739 | A69352 | Ford | Mustang | blue |
| Florida TVP-347 | B43696 | Oldsmobile | Cutlass | black |
| New York MPO-22 | X83554 | Oldsmobile | Delta | silver |
| California 432-TFY | C43742 | Mercedes | 190-D | champagne |
| California RSK-629 | Y82935 | Toyota | Camry | red |
| Texas RSK-629 | U028365 | Jaguar | XJS | blue |
| make | model | color |
|------------|---------|-----------|
| Ford | Mustang | blue |
| Oldsmobile | Cutlass | black |
| Oldsmobile | Delta | silver |
| Mercedes | 190-D | champagne |
| Toyota | Camry | red |
| Jaguar | XJS | blue |
ALTER TABLE cars
ADD COLUMN id serial PRIMARY KEY;
INSERT INTO cars
VALUES ('Volkswagen', 'Blitz', 'black');
| make | model | color | id |
|------------|---------|-----------|----|
| Ford | Mustang | blue | 1 |
| Oldsmobile | Cutlass | black | 2 |
| Oldsmobile | Delta | silver | 3 |
| Mercedes | 190-D | champagne | 4 |
| Toyota | Camry | red | 5 |
| Jaguar | XJS | blue | 6 |
| Volkswaven | Blitz | black | 7 |
INSERT INTO cars
VALUES ('Opel', 'Astra', 'green', 1);
_duplicate key value violates unique constraint "id_pkey"DETAIL: Key (id)=(1) already exists._
Another type of surrogate key
ALTER TABLE table_name
ADD COLUMN column_c varchar(256);
UPDATE table_name
SET column_c = CONCAT(column_a, column_b);
ALTER TABLE table_name
ADD CONSTRAINT pk PRIMARY KEY (column_c);
SERIAL
surrogate key¶Since there's no single column candidate key in professors
(only a composite key candidate consisting of firstname
, lastname
), you'll add a new column id
to that table.
This column has a special data type serial
, which turns the column into an auto-incrementing number. This means that, whenever you add a new professor to the table, it will automatically get an id
that does not exist yet in the table: a perfect primary key!
Instructions 1/3
id
with data type serial
to the professors
table.-- Add the new column to the table
ALTER TABLE professors
ADD COLUMN id serial;
stmt = "\
ALTER TABLE uni.professors \
ADD COLUMN id serial; \
"
conn.execute(stmt)
Instructions 2/3
Make id
a primary key and name it professors_pkey
.
-- Make id a primary key
ALTER TABLE professors
ADD CONSTRAINT professors_pkey PRIMARY KEY (id);
stmt = "\
ALTER TABLE uni.professors \
ADD CONSTRAINT professors_pkey PRIMARY KEY (id); \
"
conn.execute(stmt)
Instructions 3/3
Write a query that returns all the columns and 10 rows from professors
.
SELECT * FROM professors
LIMIT 10;
stmt = "\
SELECT * FROM uni.professors \
LIMIT 10; \
"
res = conn.execute(stmt)
res.fetchall()
As you can see, PostgreSQL has automatically numbered the rows with the id
column, which now functions as a (surrogate) primary key – it uniquely identifies professors.
CONCAT
enate columns to a surrogate key¶Another strategy to add a surrogate key to an existing table is to concatenate existing columns with the CONCAT()
function.
Let's think of the following example table:
CREATE TABLE cars (
make varchar(64) NOT NULL,
model varchar(64) NOT NULL,
mpg integer NOT NULL
)
The table is populated with 10 rows of completely fictional data.
Unfortunately, the table doesn't have a primary key yet. None of the columns consists of only unique values, so some columns can be combined to form a key.
In the course of the following exercises, you will combine make
and model
into such a surrogate key.
Instructions 1/4
make
and model
columns.-- Count the number of distinct rows with columns make, model
SELECT COUNT(DISTINCT(make, model))
FROM cars;
# Create a new schema and table
conn.execute('CREATE SCHEMA auto;')
conn.execute(' \
CREATE TABLE auto.cars ( \
"make" varchar(64) NOT NULL, \
"model" varchar(64) NOT NULL, \
"mpg" integer NOT NULL \
);')
data = {'make': ['Subaru', 'Opel', 'Opel', 'Ford', 'Ford', 'Toyota', 'Toyota', 'Toyota', 'Mitsubishi', 'Mitsubishi'],
'model': ['Forester', 'Astra', 'Vectra', 'Avenger', 'Galaxy', 'Prius', 'Speedster', 'Galaxy', 'Forester', 'Galaxy'],
'mpg': [24, 45, 40, 30, 30, 50, 30, 20, 10, 30]}
cars = pd.DataFrame(data)
cars.to_sql('cars', con=engine, schema='auto', index=False, if_exists='replace')
stmt = "\
SELECT COUNT(DISTINCT(make, model)) \
FROM auto.cars; \
"
res = conn.execute(stmt)
res.fetchall()
Instructions 2/4
Add a new column id
with the data type varchar(128)
.
-- Add the id column
ALTER TABLE cars
ADD COLUMN id varchar(128);
stmt = "\
ALTER TABLE auto.cars \
ADD COLUMN id varchar(128); \
"
conn.execute(stmt)
Instructions 3/4
Concatenate make
and model
into id
using an UPDATE table_name SET column_name = ...
query and the CONCAT()
function.
-- Update id with make + model
UPDATE cars
SET id = CONCAT(make, model);
stmt = "\
UPDATE auto.cars \
SET id = CONCAT(make, model); \
"
conn.execute(stmt)
Instructions 4/4
Make id
a primary key and name it id_pk
.
-- Make id a primary key
ALTER TABLE cars
ADD CONSTRAINT id_pk PRIMARY KEY(id);
-- Have a look at the table
SELECT * FROM cars;
stmt = "\
ALTER TABLE auto.cars \
ADD CONSTRAINT id_pk PRIMARY KEY(id); \
"
conn.execute(stmt)
res = conn.execute("SELECT * FROM auto.cars")
res.fetchall()
Before you move on to the next chapter, let's quickly review what you've learned so far about attributes and key constraints. If you're unsure about the answer, please quickly review chapters 2 and 3, respectively.
Let's think of an entity type "student". A student has:
Instructions
students
with the correct column types.PRIMARY KEY
for the social security number ssn
.-- Create the table
CREATE TABLE students (
last_name char(128) NOT NULL,
ssn integer[9] UNIQUE,
phone_no varchar(12),
PRIMARY KEY (ssn)
);
conn.execute(' \
CREATE TABLE uni.students ( \
"last_name" char(128) NOT NULL, \
"ssn" integer[9] UNIQUE, \
"phone_no" varchar(12), \
PRIMARY KEY (ssn) \
);')
In the final chapter, you'll leverage foreign keys to connect tables and establish relationships that will greatly benefit your data quality. And you'll run ad hoc analyses on your new database.
Current DB model
Next DB model
SELECT * FROM professors LIMIT 8;
| id | firstname | lastname | university_shortname |
|----|-----------------|-------------|----------------------|
| 1 | Karl | Aberer | EPF |
| 2 | Reza Shokrollah | Abhari | ETH |
| 3 | Georges | Abou Jaoudé | EPF |
| 4 | Hugues | Abriel | UBE |
| 5 | Daniel | Aebersold | UBE |
| 6 | Marcelo | Aebi | ULA |
| 7 | Christoph | Aebi | UBE |
| 8 | Patrick | Aebischer | EPF |
SELECT * FROM universities;
| id | university | university_city |
|-----|----------------|-----------------|
| EPF | ETH Lausanne | Lausanne |
| ETH | ETH Zürich | Zurich |
| UBA | Uni Basel | Basel |
| UBE | Uni Bern | Bern |
| UFR | Uni Freiburg | Fribourg |
| UGE | Uni Genf | Geneva |
| ULA | Uni Lausanne | Lausanne |
| UNE | Uni Neuenburg | Neuchâtel |
| USG | Uni St. Gallen | Saint Gallen |
| USI | USI Lugano | Lugano |
| UZH | Uni Zürich | Zurich |
Specifying foreign keys
CREATE TABLE manufacturers (
name varchar(255) PRIMARY KEY
);
INSERT INTO manufacturers
VALUES ('Ford'), ('VW'), ('GM');
CREATE TABLE cars (
model varchar(255) PRIMARY KEY,
manufacturer_name integer REFERENCES manufacturers (name)
);
INSERT INTO cars
VALUES ('Ranger', 'Ford'), ('Beetle', 'VW');
-- Throws an error!
INSERT INTO cars
VALUES ('Tundra', 'Toyota');
Specifying foreign keys to existing tables
ALTER TABLE a
ADD CONSTRAINT a_fkey FOREIGN KEY (b_id) REFERENCES b (id);
In your database, you want the professors
table to reference the universities
table. You can do that by specifying a column in professors
table that references a column in the universities
table.
As just shown in the video, the syntax for that looks like this:
ALTER TABLE a
ADD CONSTRAINT a_fkey FOREIGN KEY (b_id) REFERENCES b (id);
Table a
should now refer to table b
, via b_id
, which points to id
. a_fkey
is, as usual, a constraint name you can choose on your own.
Pay attention to the naming convention employed here: Usually, a foreign key referencing another primary key with name id
is named x_id
, where x
is the name of the referencing table in the singular form.
Instructions 1/2
university_shortname
column to university_id
in professors
.-- Rename the university_shortname column
ALTER TABLE professors
RENAME COLUMN university_shortname to university_id;
stmt = "\
ALTER TABLE uni.professors \
RENAME COLUMN university_shortname to university_id; \
"
conn.execute(stmt)
Instructions 2/2
university_id
column in professors
that references the id
column in universities
.professors_fkey
.-- Add a foreign key on professors referencing universities
ALTER TABLE professors
ADD CONSTRAINT professors_fkey FOREIGN KEY (university_id) REFERENCES universities (id);
stmt = "\
ALTER TABLE uni.professors \
ADD CONSTRAINT professors_fkey FOREIGN KEY (university_id) REFERENCES uni.universities (id); \
"
conn.execute(stmt)
Now, the professors
table has a link to the universities
table. Each professor belongs to exactly one university.
Foreign key constraints help you to keep order in your database mini-world. In your database, for instance, only professors belonging to Swiss universities should be allowed, as only Swiss universities are part of the universities
table.
The foreign key on professors
referencing universities you just created thus makes sure that only existing universities
can be specified when inserting new data. Let's test this!
Instructions
-- Try to insert a new professor
INSERT INTO professors (firstname, lastname, university_id)
VALUES ('Albert', 'Einstein', 'MIT');
insert or update on table "professors" violates foreign key constraint "professors_fkey"
DETAIL: Key (university_id)=(MIT) is not present in table "universities".
university_id
so that it actually reflects where Albert Einstein wrote his dissertation and became a professor – at the University of Zurich (UZH)!-- Try to insert a new professor
INSERT INTO professors (firstname, lastname, university_id)
VALUES ('Albert', 'Einstein', 'UZH');
stmt = "\
INSERT INTO uni.professors (firstname, lastname, university_id) \
VALUES ('Albert', 'Einstein', 'MIT'); \
"
conn.execute(stmt)
stmt = "\
INSERT INTO uni.professors (firstname, lastname, university_id) \
VALUES ('Albert', 'Einstein', 'UZH'); \
"
conn.execute(stmt)
As you can see, inserting a professor with non-existing university IDs violates the foreign key constraint you've just added. This also makes sure that all universities are spelled equally – adding to data consistency.
Let's join these two tables to analyze the data further!
You might already know how SQL joins work from the Intro to SQL for Data Science course (last exercise) or from Joining Data in PostgreSQL.
Here's a quick recap on how joins generally work:
SELECT ...
FROM table_a
JOIN table_b
ON ...
WHERE ...
While foreign keys and primary keys are not strictly necessary for join queries, they greatly help by telling you what to expect. For instance, you can be sure that records referenced from table A will always be present in table B – so a join from table A will always find something in table B. If not, the foreign key constraint would be violated.
Instructions
JOIN
professors
with universities
on professors.university_id = universities.id
, i.e., retain all records where the foreign key of professor
s is equal to the primary key of universities
.university_city = 'Zurich'
.-- Select all professors working for universities in the city of Zurich
SELECT professors.lastname, universities.id, universities.university_city
FROM professors
JOIN universities
ON professors.university_id = universities.id
WHERE universities.university_city = 'Zurich';
stmt = "\
SELECT professors.lastname, universities.id, universities.university_city \
FROM uni.professors \
JOIN uni.universities \
ON professors.university_id = universities.id \
WHERE universities.university_city = 'Zurich'; \
"
res = conn.execute(stmt)
res.fetchall()
First, the university belonging to each professor was attached with the JOIN
operation. Then, only professors having "Zurich" as university city were retained with the WHERE
clause.
How to implement N:M-relationships
CREATE TABLE affiliations (
professor_id integer REFERENCES professors (id),
organization_id varchar(256) REFERENCES organization (id),
function varchar(256)
);
Time to implement this
At the moment, the affiliations table has the structure {firstname
, lastname
, function
, organization
}, as you can see in the preview at the bottom right. In the next three exercises, you're going to turn this table into the form {professor_id
, organization_id
, function
}, with professor_id and organization_id being foreign keys that point to the respective tables.
You're going to transform the affiliations table in-place, i.e., without creating a temporary table to cache your intermediate results.
Instructions 1/3
professor_id
column with integer
data type to affiliations
, and declare it to be a foreign key that references the id
column in professors
.-- Add a professor_id column
ALTER TABLE affiliations
ADD COLUMN professor_id integer REFERENCES professors (id);
stmt = "\
ALTER TABLE uni.affiliations \
ADD COLUMN professor_id integer REFERENCES uni.professors (id); \
"
conn.execute(stmt)
Instructions 2/3
Rename the organization
column in affiliations
to organization_id
.
-- Rename the organization column to organization_id
ALTER TABLE affiliations
RENAME COLUMN organization TO organization_id;
stmt = "\
ALTER TABLE uni.affiliations \
RENAME COLUMN organization TO organization_id; \
"
conn.execute(stmt)
Instructions 3/3
Add a foreign key constraint on organization_id
so that it references the id
column in organizations
.
-- Add a foreign key on organization_id
ALTER TABLE affiliations
ADD CONSTRAINT affiliations_organization_fkey FOREIGN KEY (organization_id) REFERENCES organizations (id);
stmt = "\
ALTER TABLE uni.affiliations \
ADD CONSTRAINT affiliations_organization_fkey FOREIGN KEY (organization_id) REFERENCES uni.organizations (id); \
"
conn.execute(stmt)
Making organization_id
a foreign key worked flawlessly because these organizations actually exist in the organizations table
. That was only the first part, though. Now it's time to update professor_id
in affiliations
– so that it correctly refers to the corresponding professors.
Now it's time to also populate professors_id
. You'll take the ID directly from professors
.
Here's a way to update columns of a table based on values in another table:
UPDATE table_a
SET column_to_update = table_b.column_to_update_from
FROM table_b
WHERE condition1 AND condition2 AND ...;
This query does the following:
table_a
, find the corresponding row in table_b
where condition1
, condition2
, etc., are met.column_to_update
to the value of column_to_update_from
(from that corresponding row).The conditions usually compare other columns of both tables, e.g. table_a.some_column = table_b.some_column
. Of course, this query only makes sense if there is only one matching row in table_b
.
Instructions 1/3
affiliations
by fetching 10 rows and all columns.-- Have a look at the 10 first rows of affiliations
SELECT * FROM affiliations
LIMIT 10;
stmt = "\
SELECT * FROM uni.affiliations \
LIMIT 10; \
"
res = conn.execute(stmt)
res.fetchall()
Instructions 2/3
professor_id
column with the corresponding value of the id
column in professors
.professors
where the firstname
and lastname
are identical to the ones in affiliations
.-- Set professor_id to professors.id where firstname, lastname correspond to rows in professors
UPDATE affiliations
SET professor_id = professors.id
FROM professors
WHERE affiliations.firstname = professors.firstname AND affiliations.lastname = professors.lastname;
stmt = "\
UPDATE uni.affiliations \
SET professor_id = professors.id \
FROM uni.professors \
WHERE affiliations.firstname = professors.firstname AND affiliations.lastname = professors.lastname; \
"
conn.execute(stmt)
Instructions 3/3
affiliations
again. Have the professor_ids
been correctly matched?-- Have a look at the 10 first rows of affiliations again
SELECT * FROM affiliations
LIMIT 10;
stmt = "\
SELECT * FROM uni.affiliations \
LIMIT 10; \
"
res = conn.execute(stmt)
res.fetchall()
As you can see, the correct professors.id
has been inserted into professor_id
for each record, thanks to the matching firstname
and lastname
in both tables.
The firstname and lastname columns of affiliations were used to establish a link to the professors table in the last exercise – so the appropriate professor IDs could be copied over. This only worked because there is exactly one corresponding professor for each row in affiliations. In other words: {firstname, lastname} is a candidate key of professors – a unique combination of columns.
It isn't one in affiliations though, because, as said in the video, professors can have more than one affiliation.
Because professors are referenced by professor_id now, the firstname and lastname columns are no longer needed, so it's time to drop them. After all, one of the goals of a database is to reduce redundancy where possible.
Instructions
firstname
and lastname
columns from the affiliations
table.-- Drop the firstname & lastname column
ALTER TABLE affiliations
DROP COLUMN firstname;
ALTER TABLE affiliations
DROP COLUMN lastname;
stmt = "\
ALTER TABLE uni.affiliations \
DROP COLUMN firstname, \
DROP COLUMN lastname; \
"
conn.execute(stmt)
Now the affiliations
table that models the N:M-relationship between professors
and organizations
is finally complete.
referential integrity violations
Dealing with violations
CREATE TABLE a (
id integer PRIMARY KEY,
column_a varchar(64),
...,
b_id integer REFERENCES b (id) ON DELETE NO ACTION
);
Other options
CREATE TABLE a (
id integer PRIMARY KEY,
column_a varchar(64),
...,
b_id integer REFERENCES b (id) ON DELETE CASCADE
);
Given the current state of your database, what happens if you execute the following SQL statement?
DELETE FROM universities WHERE id = 'EPF';
Possible Answers
You defined a foreign key on professors.university_id
that references universities.id
, so referential integrity is said to hold from professors
to universities
.
So far, you implemented three foreign key constraints:
professors.university_id
to universities.id
affiliations.organization_id
to organizations.id
affiliations.professor_id
to professors.id
These foreign keys currently have the behavior ON DELETE NO ACTION
. Here, you're going to change that behavior for the column referencing organizations
from affiliations
. If an organization is deleted, all its affiliations (by any professor) should also be deleted.
Altering a key constraint doesn't work with ALTER COLUMN
. Instead, you have to delete the key constraint and then add a new one with a different ON DELETE
behavior.
For deleting constraints, though, you need to know their name. This information is also stored in information_schema
.
Instructions 1/4
table_constraint
s in information_schema
.-- Identify the correct constraint name
SELECT constraint_name, table_name, constraint_type
FROM information_schema.table_constraints
WHERE constraint_type = 'FOREIGN KEY';
stmt = "\
SELECT constraint_name, table_name, constraint_type \
FROM information_schema.table_constraints \
WHERE constraint_type = 'FOREIGN KEY'; \
"
res = conn.execute(stmt)
res.fetchall()
Instructions 2/4
affiliations_organization_id_fkey
foreign key constraint in affiliations
.-- Drop the right foreign key constraint
ALTER TABLE affiliations
DROP CONSTRAINT affiliations_organization_id_fkey;
stmt = "\
ALTER TABLE uni.affiliations \
DROP CONSTRAINT affiliations_organization_fkey; \
"
conn.execute(stmt)
Instructions 3/4
affiliations
that cascades deletion if a referenced record is deleted from organizations
. Name it affiliations_organization_id_fkey
.-- Add a new foreign key constraint from affiliations to organizations which cascades deletion
ALTER TABLE affiliations
ADD CONSTRAINT affiliations_organization_id_fkey FOREIGN KEY (organization_id) REFERENCES organizations (id) ON DELETE CASCADE;
stmt = "\
ALTER TABLE uni.affiliations \
ADD CONSTRAINT affiliations_organization_fkey FOREIGN KEY (organization_id) REFERENCES uni.organizations (id) ON DELETE CASCADE; \
"
conn.execute(stmt)
Instructions 4/4
DELETE
and SELECT
queries to double check that the deletion cascade actually works.-- Delete an organization
DELETE FROM organizations
WHERE id = 'CUREM';
-- Check that no more affiliations with this organization exist
SELECT * FROM affiliations
WHERE organization_id = 'CUREM';
stmt = "\
DELETE FROM uni.organizations \
WHERE id = 'CUREM'; \
"
conn.execute(stmt)
stmt = "\
SELECT * FROM uni.affiliations \
WHERE organization_id = 'CUREM'; \
"
res = conn.execute(stmt)
res.fetchall()
As you can see, whenever an organization referenced by an affiliation is deleted, the affiliations also gets deleted. It is your job as database designer to judge whether this is a sensible configuration. Sometimes, setting values to NULL
or to restrict deletion altogether might make more sense!
How the database has been transformed
The database ecosystem
Now that your data is ready for analysis, let's run some exemplary SQL queries on the database. You'll now use already known concepts such as grouping by columns and joining tables.
In this exercise, you will find out which university has the most affiliations (through its professors). For that, you need both affiliations
and professors
tables, as the latter also holds the university_id
.
As a quick repetition, remember that joins have the following structure:
SELECT table_a.column1, table_a.column2, table_b.column1, ...
FROM table_a
JOIN table_b
ON table_a.column = table_b.column
This results in a combination of table_a
and table_b
, but only with rows where table_a.column
is equal to table_b.column
.
Instructions
-- Count the total number of affiliations per university
SELECT COUNT(*), professors.university_id
FROM affiliations
JOIN professors
ON affiliations.professor_id = professors.id
-- Group by the ids of professors
GROUP BY professors.id
ORDER BY count DESC;
stmt = "\
SELECT COUNT(*), professors.university_id \
FROM uni.affiliations \
JOIN uni.professors \
ON affiliations.professor_id = professors.id \
GROUP BY professors.university_id \
ORDER BY count DESC LIMIT 10; \
"
res = conn.execute(stmt)
res.fetchall()
As you can see, the count of affiliations is completely different from university to university.
In this last exercise, you will find the university city of the professor with the most affiliations in the sector "Media & communication".
For this, you need to join all the tables, group by a column, and then use selection criteria to get only the rows in the correct sector.
Instructions 1/3
affiliations
, professors
, organizations
, and universities
) and look at the result.-- Join all tables
SELECT *
FROM affiliations
JOIN professors
ON affiliations.professor_id = professors.id
JOIN organizations
ON affiliations.organization_id = organizations.id
JOIN universities
ON professors.university_id = universities.id;
stmt = "\
SELECT * \
FROM uni.affiliations \
JOIN uni.professors \
ON affiliations.professor_id = professors.id \
JOIN uni.organizations \
ON affiliations.organization_id = organizations.id \
JOIN uni.universities \
ON professors.university_id = universities.id \
LIMIT 10; \
"
res = conn.execute(stmt)
res.fetchall()
res = conn.execute(stmt).fetchall()
# Create a DataFrame from the results: df
df = pd.DataFrame(res)
# Set column names
df.columns = res[0].keys()
df.head()
Instructions 2/3
-- Group the table by organization sector, professor and university city
SELECT COUNT(*), organizations.organization_sector, professors.id, universities.university_city
FROM affiliations
JOIN professors
ON affiliations.professor_id = professors.id
JOIN organizations
ON affiliations.organization_id = organizations.id
JOIN universities
ON professors.university_id = universities.id
GROUP BY organizations.organization_sector, professors.id, universities.university_city;
stmt = "\
SELECT COUNT(*), organizations.organization_sector, professors.id, universities.university_city \
FROM uni.affiliations \
JOIN uni.professors \
ON affiliations.professor_id = professors.id \
JOIN uni.organizations \
ON affiliations.organization_id = organizations.id \
JOIN uni.universities \
ON professors.university_id = universities.id \
GROUP BY organizations.organization_sector, professors.id, universities.university_city \
LIMIT 10; \
"
res = conn.execute(stmt).fetchall()
df = pd.DataFrame(res)
df.columns = res[0].keys()
df
Instructions 3/3
-- Filter the table and sort it
SELECT COUNT(*), organizations.organization_sector,
professors.id, universities.university_city
FROM affiliations
JOIN professors
ON affiliations.professor_id = professors.id
JOIN organizations
ON affiliations.organization_id = organizations.id
JOIN universities
ON professors.university_id = universities.id
WHERE organizations.organization_sector = 'Media & communication'
GROUP BY organizations.organization_sector,
professors.id, universities.university_city
ORDER BY count DESC;
stmt = "\
SELECT COUNT(*), organizations.organization_sector, professors.id, universities.university_city \
FROM uni.affiliations \
JOIN uni.professors \
ON affiliations.professor_id = professors.id \
JOIN uni.organizations \
ON affiliations.organization_id = organizations.id \
JOIN uni.universities \
ON professors.university_id = universities.id \
WHERE organizations.organization_sector = 'Media & communication' \
GROUP BY organizations.organization_sector, professors.id, universities.university_city \
ORDER BY count DESC LIMIT 10; \
"
res = conn.execute(stmt).fetchall()
df = pd.DataFrame(res)
df.columns = res[0].keys()
df
The professor with id
538 (different for the downloaded dataset) has the most affiliations in the "Media & communication" sector, and he or she lives in the city of Lausanne. Thanks to your database design, you can be sure that the data you've just queried is consistent. Of course, you could also put university_city
and organization_sector
in their own tables, making the data model even more formal. However, in database design, you have to strike a balance between modeling overhead, desired data consistency, and usability for queries like the one you've just wrote. Congratulations, you made it to the end!
conn.close()