Course Description

You’ve already used SQL to query data from databases. But did you know that there's a lot more you can do with databases? You can model different phenomena in your data, as well as the relationships between them. This gives your data structure and consistency, which results in better data quality. In this course, you'll experience this firsthand by working with a real-life dataset that was used to investigate questionable university affiliations. Column by column, table by table, you'll get to unlock and admire the full potential of databases. You'll learn how to create tables and specify their relationships, as well as how to enforce data integrity. You'll also discover other unique features of database systems, such as constraints.

Datasets

Imports

In [1]:
from sqlalchemy import create_engine
from sqlalchemy import MetaData
from sqlalchemy import Table
from sqlalchemy import Column
from sqlalchemy import Integer, String
from sqlalchemy import inspect
from sqlalchemy.types import Text
import pandas as pd
from pprint import pprint as pp
from itertools import combinations

Pandas Configuration Options

In [2]:
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)

PostgreSQL Connection

  • In order to run this Notebook, install, setup and configure a PostgreSQL database with the previously mentioned datasets.
  • Edit engine to use your database username and password.
In [3]:
engine = create_engine('postgresql://postgres:postgres@localhost/postgres')
In [4]:
conn = engine.connect()

Create the university_professors table

Create an empty table with conn

In [5]:
conn.execute('CREATE SCHEMA uni;')
conn.execute(' \
CREATE TABLE uni.university_professors ( \
  "firstname" text, \
  "lastname" text, \
  "university" text, \
  "university_shortname" text, \
  "university_city" text, \
  "function" text, \
  "organization" text, \
  "organization_sector" text \
);')
Out[5]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0b0359c8>

Create the filled table from pandas

In [6]:
uni_prof = pd.read_excel('data/2019-12-13_intro_to_relational_databases/university_professors.xlsx')
In [7]:
uni_prof.head()
Out[7]:
firstname lastname university university_shortname university_city function organization organization_sector
0 Karl Aberer ETH Lausanne EPF Lausanne Chairman of L3S Advisory Board L3S Advisory Board Education & research
1 Karl Aberer ETH Lausanne EPF Lausanne Member Conseil of Zeno-Karl Schindler Foundation Zeno-Karl Schindler Foundation Education & research
2 Karl Aberer ETH Lausanne EPF Lausanne Member of Conseil Fondation IDIAP Fondation IDIAP Education & research
3 Karl Aberer ETH Lausanne EPF Lausanne Panel Member SNF Ambizione Program Education & research
4 Reza Shokrollah Abhari ETH Zürich ETH Zurich Aufsichtsratsmandat PNE Wind AG Energy, environment & mobility
In [8]:
_dtypes = {'firstname': Text(), 'lastname': Text(), 'university': Text(), 'university_shortname': Text(),
           'university_city': Text(), 'function': Text(), 'organization': Text(), 'organization_sector': Text()}
In [9]:
uni_prof.to_sql('university_professors', con=engine, schema='uni', index=False, if_exists='replace', dtype=_dtypes)

Your first database

In this chapter, you'll create your very first database with a set of simple SQL commands. Next, you'll migrate data from existing flat tables into that database. You'll also learn how meta-information about a database can be queried.

Introduction to relational databases

Attributes of relational databases

In the video, we talked about some basic facts about relational databases. Which of the following statements does not hold true for databases? Relational databases ...

Possible Answers

  • ... store different real-world entities in different tables.
  • ... allow to establish relationships between entities.
  • ... are called "relational" because they store data only about people.
  • ... use constraints, keys and referential integrity in order to assure data quality.

Query information_schema with SELECT

information_schema is a meta-database that holds information about your current database. information_schema has multiple tables you can query with the known SELECT * FROM syntax:

  • tables: information about all tables in your current database
  • columns: information about all columns in all of the tables in your current database
  • ...

In this exercise, you'll only need information from the 'public' schema, which is specified as the column table_schema of the tables and columns tables. The 'public' schema holds information about user-defined tables and databases. The other types of table_schema hold system information - for this course, you're only interested in user-defined stuff.

Instructions 1/4

  • Get information on all table names in the current database, while limiting your query to the 'public' table_schema.
-- Query the right table in information_schema
SELECT table_name 
FROM information_schema.tables
-- Specify the correct table_schema value
WHERE table_schema = 'public';
In [10]:
stmt = "\
SELECT table_name \
FROM information_schema.tables \
WHERE table_schema = 'uni'; \
"
res = conn.execute(stmt)
res.fetchall()
Out[10]:
[('university_professors',)]

Instructions 2/4

  • Now have a look at the columns in university_professors by selecting all entries in information_schema.columns that correspond to that table.
-- Query the right table in information_schema to get columns
SELECT column_name, data_type 
FROM information_schema.columns 
WHERE table_name = 'university_professors' AND table_schema = 'public';
In [11]:
stmt = "\
SELECT column_name, data_type \
FROM information_schema.columns \
WHERE table_name = 'university_professors' AND table_schema = 'uni'; \
"
res = conn.execute(stmt)
res.fetchall()
Out[11]:
[('firstname', 'text'),
 ('lastname', 'text'),
 ('university', 'text'),
 ('university_shortname', 'text'),
 ('university_city', 'text'),
 ('function', 'text'),
 ('organization', 'text'),
 ('organization_sector', 'text')]

Instructions 3/4

How many columns does the table university_professors have?

Possible Answers

  • 12
  • 9
  • 8
  • 5

Instructions 4/4

  • Finally, print the first five rows of the university_professors table.
-- Query the first five rows of our table
SELECT * 
FROM university_professors 
LIMIT 5;
In [12]:
stmt = "\
SELECT * \
FROM uni.university_professors \
LIMIT 5; \
"

pd.read_sql(stmt, conn)
Out[12]:
firstname lastname university university_shortname university_city function organization organization_sector
0 Karl Aberer ETH Lausanne EPF Lausanne Chairman of L3S Advisory Board L3S Advisory Board Education & research
1 Karl Aberer ETH Lausanne EPF Lausanne Member Conseil of Zeno-Karl Schindler Foundation Zeno-Karl Schindler Foundation Education & research
2 Karl Aberer ETH Lausanne EPF Lausanne Member of Conseil Fondation IDIAP Fondation IDIAP Education & research
3 Karl Aberer ETH Lausanne EPF Lausanne Panel Member SNF Ambizione Program Education & research
4 Reza Shokrollah Abhari ETH Zürich ETH Zurich Aufsichtsratsmandat PNE Wind AG Energy, environment & mobility

Tables: At the core of every database

CREATE your first few TABLEs

You'll now start implementing a better database model. For this, you'll create tables for the professors and universities entity types. The other tables will be created for you.

The syntax for creating simple tables is as follows:

CREATE TABLE table_name (
 column_a data_type,
 column_b data_type,
 column_c data_type
);

Attention: Table and columns names, as well as data types, don't need to be surrounded by quotation marks.

Instructions 1/2

  • Create a table professors with two text columns: firstname and lastname.
-- Create a table for the professors entity type
CREATE TABLE professors (
 firstname text,
 lastname text
);

-- Print the contents of this table
SELECT * 
FROM professors
In [13]:
stmt = "\
CREATE TABLE uni.professors ( \
 firstname text, \
 lastname text \
); \
"
conn.execute(stmt)

stmt = "\
SELECT *  \
FROM uni.professors \
"

pd.read_sql(stmt, conn)
Out[13]:
firstname lastname

Instructions 2/2

  • Create a table universities with three text columns: university_shortname, university, and university_city.
-- Create a table for the universities entity type
CREATE TABLE universities (
university_shortname text,
university text,
university_city text);

-- Print the contents of this table
SELECT * 
FROM universities
In [14]:
stmt = "\
CREATE TABLE uni.universities ( \
university_shortname text, \
university text, \
university_city text); \
"
conn.execute(stmt)

stmt = "\
SELECT *  \
FROM uni.universities \
"

pd.read_sql(stmt, conn)
Out[14]:
university_shortname university university_city

ADD a COLUMN with ALTER TABLE

We forgot to add the university_shortname column to the professors table. You've probably already noticed:

In chapter 4 of this course, you'll need this column for connecting the professors table with the universities table.

However, adding columns to existing tables is easy, especially if they're still empty.

To add columns you can use the following SQL query:

ALTER TABLE table_name
ADD COLUMN column_name data_type;

Instructions

  • Alter professors to add the text column university_shortname.
-- Add the university_shortname column
ALTER TABLE professors
ADD COLUMN university_shortname text;

-- Print the contents of this table
SELECT * 
FROM professors
In [15]:
stmt = "\
ALTER TABLE uni.professors \
ADD COLUMN university_shortname text; \
"
conn.execute(stmt)

stmt = "\
SELECT *  \
FROM uni.professors \
"

pd.read_sql(stmt, conn)
Out[15]:
firstname lastname university_shortname

Update your database as the structure changes

Create the affiliations table

CREATE TABLE affiliations (
 firstname text,
 lastname text,
 university_shortname text,
 function text,
 organisation text
);
In [16]:
stmt = "\
CREATE TABLE uni.affiliations ( \
 firstname text, \
 lastname text, \
 university_shortname text, \
 function text, \
 organisation text \
); \
"
conn.execute(stmt)
Out[16]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e963308>

Create the organizations table

CREATE TABLE organizations (
 organization text,
 organization_sector text
);
In [17]:
stmt = "\
CREATE TABLE uni.organizations ( \
 organization text, \
 organization_sector text \
); \
"
conn.execute(stmt)
Out[17]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e948288>

RENAME and DROP COLUMNs in affiliations

As mentioned in the video, the still empty affiliations table has some flaws. In this exercise, you'll correct them as outlined in the video.

You'll use the following queries:

To rename columns:

ALTER TABLE table_name
RENAME COLUMN old_name TO new_name;

To delete columns:

ALTER TABLE table_name
DROP COLUMN column_name;

Instructions 1/2

  • Rename the organisation column to organization in affiliations.
-- Rename the organisation column
ALTER TABLE affiliations
RENAME COLUMN organisation TO organization;
In [18]:
stmt = "\
ALTER TABLE uni.affiliations \
RENAME COLUMN organisation TO organization; \
"
conn.execute(stmt)

stmt = "\
SELECT *  \
FROM uni.affiliations \
"

pd.read_sql(stmt, conn)
Out[18]:
firstname lastname university_shortname function organization

Instructions 2/2

  • Delete the university_shortname column in affiliations.
-- Delete the university_shortname column
ALTER TABLE affiliations
DROP COLUMN university_shortname;
In [19]:
stmt = "\
ALTER TABLE uni.affiliations \
DROP COLUMN university_shortname; \
"
conn.execute(stmt)

stmt = "\
SELECT *  \
FROM uni.affiliations \
"

pd.read_sql(stmt, conn)
Out[19]:
firstname lastname function organization

Migrate date with INSERT INTO SELECT DISTINCT

Now it's finally time to migrate the data into the new tables. You'll use the following pattern:

INSERT INTO ... 
SELECT DISTINCT ... 
FROM ...;

It can be broken up into two parts:

First part:

SELECT DISTINCT column_name1, column_name2, ... 
FROM table_a;

This selects all distinct values in table table_a - nothing new for you.

Second part:

INSERT INTO table_b ...;

Take this part and append it to the first, so it inserts all distinct rows from table_a into table_b.

One last thing: It is important that you run all of the code at the same time once you have filled out the blanks.

Instructions 1/2

  • Insert all DISTINCT professors from university_professors into professors.
  • Print all the rows in professors.
-- Insert unique professors into the new table
INSERT INTO professors 
SELECT DISTINCT firstname, lastname, university_shortname 
FROM university_professors;

-- Doublecheck the contents of professors
SELECT * 
FROM professors;
In [20]:
stmt = "\
INSERT INTO uni.professors \
SELECT DISTINCT firstname, lastname, university_shortname \
FROM uni.university_professors; \
"
conn.execute(stmt)

stmt = "\
SELECT *  \
FROM uni.professors \
"

pd.read_sql(stmt, conn)
Out[20]:
firstname lastname university_shortname
0 Michel Rappaz EPF
1 Hilal Lashuel EPF
2 Jeffrey Huang EPF
3 Pierre Magistretti EPF
4 Paolo Ienne EPF
... ... ... ...
546 Birgitt Borkopp UBE
547 Basile Cardinaux UFR
548 Marco Celio UFR
549 Anna Fontcuberta Morral EPF
550 Markus Hans Gross ETH

551 rows × 3 columns

Instructions 2/2

  • Insert all DISTINCT affiliations into affiliations from university_professors.
-- Insert unique affiliations into the new table
INSERT INTO affiliations 
SELECT DISTINCT firstname, lastname, function, organization 
FROM university_professors;

-- Doublecheck the contents of affiliations
SELECT * 
FROM affiliations;
In [21]:
stmt = "\
INSERT INTO uni.affiliations \
SELECT DISTINCT firstname, lastname, function, organization \
FROM uni.university_professors; \
"
conn.execute(stmt)

stmt = "\
SELECT *  \
FROM uni.affiliations \
"

pd.read_sql(stmt, conn)
Out[21]:
firstname lastname function organization
0 Dimos Poulikakos VR-Mandat Scrona AG
1 Francesco Stellacci Co-editor in Chief, Nanoscale Royal Chemistry Society, UK
2 Alexander Fust Fachexperte und Coach für Designer Startups Creative Hub
3 Jürgen Brugger Proposal reviewing HEPIA HES Campus Biotech, Genève
4 Hervé Bourlard Director Idiap Research Institute
... ... ... ... ...
1372 Ambrogio Fasoli Editor-in-Chief, Nuclear Fusion Journal IAEA, Vienna, Austria
1373 Pascal Pichonnaz Gutachter Selbstständig
1374 Colin Jones SNF SCOPES evaluation panel SNSF Bern
1375 Oliver Gassmann International Advisory Board Alexander von Humboldt Institut für Internet &...
1376 Roland Fankhauser Mitglied Liatowitsch & Partner

1377 rows × 4 columns

Migrate data to the universities and organizations tables*

In [22]:
stmt = "\
INSERT INTO uni.universities \
SELECT DISTINCT university_shortname, university, university_city \
FROM uni.university_professors; \
"
conn.execute(stmt)

stmt = "\
SELECT *  \
FROM uni.universities \
"

pd.read_sql(stmt, conn)
Out[22]:
university_shortname university university_city
0 UGE Uni Genf Geneva
1 USI USI Lugano Lugano
2 UFR Uni Freiburg Fribourg
3 USG Uni St. Gallen Saint Gallen
4 ULA Uni Lausanne Lausanne
5 EPF ETH Lausanne Lausanne
6 UBE Uni Bern Bern
7 ETH ETH Zürich Zurich
8 UNE Uni Neuenburg Neuchâtel
9 UBA Uni Basel Basel
10 UZH Uni Zürich Zurich
In [23]:
stmt = "\
INSERT INTO uni.organizations \
SELECT DISTINCT organization, organization_sector \
FROM uni.university_professors; \
"
conn.execute(stmt)

stmt = "\
SELECT *  \
FROM uni.organizations \
"

pd.read_sql(stmt, conn)
Out[23]:
organization organization_sector
0 Stiftung-Sammlung Schweizer Rechtsquellen Not classifiable
1 ResponsAbility Financial industry & insurances
2 Fondation IQRGC Not classifiable
3 u-blox AG Industry, construction & agriculture
4 Departement für Entwicklungshilfe und -zusamme... Politics, administration, justice system & sec...
... ... ...
1282 Rice University, Houston, Texas Education & research
1283 Rychiger AG Industry, construction & agriculture
1284 Fondazione Eccles Education & research
1285 NEBION AG Technology
1286 Schweizer Armee Politics, administration, justice system & sec...

1287 rows × 2 columns

Delete tables with DROP TABLE

The university_professors table is now no longer needed and can safely be deleted.

For table deletion, you can use the simple command:

DROP TABLE table_name;

Instructions

  • Delete the university_professors table.
-- Delete the university_professors table
DROP TABLE university_professors;

Enforce data consistency with attribute constraints

After building a simple database, it's now time to make use of the features. You'll specify data types in columns, enforce column uniqueness, and disallow NULL values in this chapter.

Better data quality with constraints

  • The idea of a database is to push data into a certain structure - a predefined model, where you enforce data types, relationships and other rules.
    • These rules are called integrity constraints, although different names exist.

Integrity Constrains

  1. Attribute constraints: data types on columns (Ch 2)
    • A certain attribute, represented through a database column, could have the integer data type, allowing only integers to be stored in the column.
  2. Key constraints: primary keys (Ch 3)
    • Primary keys, for example, uniquely identify each record, or row, of a database table.
  3. Referential integrity constraints: enforced through foreign keys (Ch 4)
    • They glue different database tables together.

Why should you know about constraints?

  • Constraints give the data structure
    • With good constraints in place, people who type in birthdates, for example, have to always enter them in the same form.
    • Data entered by humans is often very tedious to pre-process.
  • Constraints help with consistency, and thus improve data quality.
  • Data quality is a business advantage / data science prerequisite.
  • Enforcing constraints on human-entered data is difficult and tedious, but PostgreSQL helps.

Data types as attribute constaints

Name Aliases Description
bigint int8 signed eight-byte integer
bigserial serial8 autoincrementing eight-byte integer
bit [ (n) ] fixed-length bit string
bit varying [ (n) ] varbit [ (n) ] variable-length bit string
boolean bool logical Boolean (true/false)
box rectangular box on a plane
bytea binary data (“byte array”)
character [ (n) ] char [ (n) ] fixed-length character string
character varying [ (n) ] varchar [ (n) ] variable-length character string
cidr IPv4 or IPv6 network address
circle circle on a plane
date calendar date (year, month, day)
double precision float8 double precision floating-point number (8 bytes)
inet IPv4 or IPv6 host address
integer int, int4 signed four-byte integer
interval [ fields ] [ (p) ] time span
json textual JSON data
jsonb binary JSON data, decomposed
line infinite line on a plane
lseg line segment on a plane
macaddr MAC (Media Access Control) address
macaddr8 MAC (Media Access Control) address (EUI-64 format)
money currency amount
numeric [ (p, s) ] decimal [ (p, s) ] exact numeric of selectable precision
path geometric path on a plane
pg_lsn PostgreSQL Log Sequence Number
point geometric point on a plane
polygon closed geometric path on a plane
real float4 single precision floating-point number (4 bytes)
smallint int2 signed two-byte integer
smallserial serial2 autoincrementing two-byte integer
serial serial4 autoincrementing four-byte integer
text variable-length character string
time [ (p) ] [ without time zone ] time of day (no time zone)
time [ (p) ] with time zone timetz time of day, including time zone
timestamp [ (p) ] [ without time zone ] date and time (no time zone)
timestamp [ (p) ] with time zone timestamptz date and time, including time zone
tsquery text search query
tsvector text search document
txid_snapshot user-level transaction ID snapshot
uuid universally unique identifier
xml XML data
  • From PostgreSQL documentation
  • In its simplest form, attribute constraints are data types that can be specified for each column of a table.

**Dealing with data types (casting)

  • Data types also restrict possible SQL operations on the stored data. ```sql CREATE TABLE weather ( temperature integer, wind_speed text);

SELECT temperature * wind_speed AS wind_chill FROM weather;

operator does not exist: integer * text HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.

SELECT temperature * CAST(wind_speed AS integer) AS wind_chill FROM weather; ```

  • For example, it's impossible to calculate a product from an integer and a text column, as shown here in the example.
    • The text column wind_speed may store numbers, but PostgreSQL doesn't know how to use text in a calculation.
    • The solution for this is type casts, that is, on-the-fly type conversions.
    • In this case, you can use the CAST function, followed by the column name, the AS keyword, and the desired data type. PostgreSQL will run wind_speed into an integer before the calculation.

Types of database constraints

Which of the following is not used to enforce a database constraint?

Possible Answers

  • Foreign keys
  • SQL aggregate functions
  • The BIGINT data type
  • Primary keys

Conforming with data types

For demonstration purposes, I created a fictional database table that only holds three records. The columns have the data types date, integer, and text, respectively.

CREATE TABLE transactions (
 transaction_date date, 
 amount integer,
 fee text
);
transaction_date amount fee
1999-01-08 500 20
2001-02-20 403 15
2001-03-20 3430 35
2018-09-24 5454 30
1999-01-08 500 20
2001-02-20 403 15
2001-03-20 3430 35
2018-09-24 5454 30
1999-01-08 500 20
2001-02-20 403 15
2001-03-20 3430 35
1999-01-08 500 20
2001-02-20 403 15
2001-03-20 3430 35

Have a look at the contents of the transactions table.

The transaction_date accepts date values. According to the PostgreSQL documentation, it accepts values in the form of YYYY-MM-DD, DD/MM/YY, and so forth.

Both columns amount and fee appear to be numeric, however, the latter is modeled as text - which you will account for in the next exercise.

Instructions

  • Execute the given sample code.
  • As it doesn't work, have a look at the error message and correct the statement accordingly - then execute it again.
-- Let's add a record to the table
INSERT INTO transactions (transaction_date, amount, fee) 
VALUES ('2018-24-09', 5454, '30');

-- Doublecheck the contents
SELECT *
FROM transactions;

date/time field value out of range: "2018-24-09"
LINE 3: VALUES ('2018-24-09', 5454, '30');
                ^
HINT:  Perhaps you need a different "datestyle" setting.
-- Let's add a record to the table
INSERT INTO transactions (transaction_date, amount, fee) 
VALUES ('2018-09-24', 5454, '30');

-- Doublecheck the contents
SELECT *
FROM transactions;
transaction_date amount fee
1999-01-08 500 20
2001-02-20 403 15
2001-03-20 3430 35
2018-09-24 5454 30
1999-01-08 500 20
2001-02-20 403 15
2001-03-20 3430 35
2018-09-24 5454 30
1999-01-08 500 20
2001-02-20 403 15
2001-03-20 3430 35
2018-09-24 5454 30

Type CASTs

In the video, you saw that type casts are a possible solution for data type issues. If you know that a certain column stores numbers as text, you can cast the column to a numeric form, i.e. to integer.

SELECT CAST(some_column AS integer)
FROM table;

Now, the some_column column is temporarily represented as integer instead of text, meaning that you can perform numeric calculations on the column.

Instructions

  • Execute the given sample code.
  • As it doesn't work, add an integer type cast at the right place and execute it again.
-- Calculate the net amount as amount + fee
SELECT transaction_date, amount + fee AS net_amount 
FROM transactions;

operator does not exist: integer + text
LINE 2: SELECT transaction_date, amount + fee AS net_amount 
                                        ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.
-- Calculate the net amount as amount + fee
SELECT transaction_date, amount + CAST(fee AS integer) AS net_amount 
FROM transactions;
transaction_date net_amount
1999-01-08 520
2001-02-20 418
2001-03-20 3465
2018-09-24 5484

Working with data types

  • Enforced on columns (i.e. attributes)
  • Data types define the so-called domain of values in a column (what form the values can and cannot take).
  • Define what operations are possible.
  • Enforce consistent storage of values

THe most common data types in PostgreSQL

  • text: character strings of any length
  • varchar [ (x) ]: a maximum of n characters
  • char [ (x) ]: a fixed-length string of n characters
  • boolean: can only take three states, e.g. TRUE, FALSE and NULL (unknown) or "1" or "0"
  • date, time and timestamp: various formats for date and time calculations
  • numeric: arbitrary precision numbers, e.g. 3.1457
  • integer: whole numbers in the range of -2147483648 and +2147483647
    • If this range isn't large enough, there is bigint for larger numbers

Specifying type upon table creation

CREATE TABLE students (
 ssn integer, 
 name varchar(64), 
 dob date, 
 average_grade numeric(3, 2), -- e.g. 5.54
 tuition_paid boolean 
);

Alter types after table creation

ALTER TABLE students
ALTER COLUMN name 
TYPE varchar(128);
  • Sometimes it may be necessary to truncate column values or transform them in any other way, so they fit the new data type.
  • Then the USING keyword can be used to specify a transformation that should happen before the type is altered.
ALTER TABLE students
ALTER COLUMN average_grade
TYPE integer
-- Turns 5.54 into 6, not 5, before type conversion
USING ROUND(average_grade);
  • Normally, PostgreSQL would only keep the part of the number before the fractional point. With USING, have the number rounded to the nearest integer, for example.

Change types with ALTER COLUMN

The syntax for changing the data type of a column is straightforward. The following code changes the data type of the column_name column in table_name to varchar(10):

ALTER TABLE table_name
ALTER COLUMN column_name
TYPE varchar(10)

Now it's time to start adding constraints to your database.

Instructions 1/3

  • Have a look at the distinct university_shortname values in the professors table and take note of the length of the strings.
-- Select the university_shortname column
SELECT distinct(university_shortname) 
FROM professors;
In [24]:
stmt = "\
SELECT distinct(university_shortname) \
FROM uni.professors; \
"

pd.read_sql(stmt, conn)
Out[24]:
university_shortname
0 ULA
1 UNE
2 EPF
3 USG
4 UBA
5 UBE
6 UZH
7 UGE
8 UFR
9 USI
10 ETH

Instruction 2/3

  • Now specify a fixed-length character type with the correct length for university_shortname.
-- Specify the correct fixed-length character type
ALTER TABLE professors
ALTER COLUMN university_shortname
TYPE char(3);
In [25]:
stmt = "\
ALTER TABLE uni.professors \
ALTER COLUMN university_shortname \
TYPE varchar(3); \
"
conn.execute(stmt)
Out[25]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e82a248>

Instructions 3/3

  • Change the type of the firstname column to varchar(64).
-- Change the type of firstname
ALTER TABLE professors
ALTER COLUMN firstname
TYPE varchar(64);
In [26]:
stmt = "\
ALTER TABLE uni.professors \
ALTER COLUMN firstname \
TYPE varchar(64); \
"
conn.execute(stmt)
Out[26]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e827748>

Convert types USING a function

If you don't want to reserve too much space for a certain varchar column, you can truncate the values before converting its type.

For this, you can use the following syntax:

ALTER TABLE table_name
ALTER COLUMN column_name
TYPE varchar(x)
USING SUBSTRING(column_name FROM 1 FOR x)

You should read it like this: Because you want to reserve only x characters for column_name, you have to retain a SUBSTRING of every value, i.e. the first x characters of it, and throw away the rest. This way, the values will fit the varchar(x) requirement.

Instructions

  • Run the sample code as is and take note of the error.
-- Convert the values in firstname to a max. of 16 characters
ALTER TABLE professors 
ALTER COLUMN firstname 
TYPE varchar(16)

value too long for type character varying(16)
  • Now use SUBSTRING() to reduce firstname to 16 characters so its type can be altered to varchar(16).
-- Convert the values in firstname to a max. of 16 characters
ALTER TABLE professors 
ALTER COLUMN firstname 
TYPE varchar(16)
USING SUBSTRING(firstname FROM 1 FOR 16)
  • Perfect! However, it's best not to truncate any values in your database, so we'll revert this column to varchar(64). Now it's time to move on to the next set of attribute constraints!

The not-null and unique constraints

not null constraint

  • The not-null constraint disallows NULL values in a column.
    • Applies to the current and future state of the the database
  • not-null can only be specified on a column that doesn't hold any NULL values
  • It will not be possible to insert NULL values in the future
  • What does NULL mean
    • the value is unknown
    • the value doesn't exist
    • the value does not apply to the column
CREATE TABLE students (
 ssn integer not null,
 lastname varchar(64) not null,
 home_phone integer,
 office_phone integer
);
  • An important take away is that two NULL values must not have the same meaning
  • This also means that comparing NULL with NULL always results in a FALSE value
    • not null constraints can be added to or removed from existing tables

Add not null to an existing table

ALTER TABLE students 
ALTER COLUMN home_phone 
SET NOT NULL;

Drop not null from an existing table

ALTER TABLE students 
ALTER COLUMN ssn 
DROP NOT NULL;

unique constraint

  • Disallows duplicate values in a column
  • Must hold true for the current state
  • Must hold true for any future state
CREATE TABLE table_name (
 column_name UNIQUE
);

Alter an existing table

ALTER TABLE table_name
ADD CONSTRAINT some_name UNIQUE(column_name);

Disallow NULL values with SET NOT NULL

The professors table is almost ready now. However, it still allows for NULLs to be entered. Although some information might be missing about some professors, there's certainly columns that always need to be specified.

Instructions 1/2

  • Add a not-null constraint for the firstname column.
-- Disallow NULL values in firstname
ALTER TABLE professors 
ALTER COLUMN firstname SET NOT NULL;
In [27]:
stmt = "\
ALTER TABLE uni.professors \
ALTER COLUMN firstname SET NOT NULL; \
"
conn.execute(stmt)
Out[27]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0ea25788>

Instructions 2/2

  • Add a not-null constraint for the lastname column.
-- Disallow NULL values in lastname
ALTER TABLE professors 
ALTER COLUMN lastname SET NOT NULL;
In [28]:
stmt = "\
ALTER TABLE uni.professors \
ALTER COLUMN lastname SET NOT NULL; \
"
conn.execute(stmt)
Out[28]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e8253c8>

Good job - it is no longer possible to add professors which have either their first or last name set to NULL. Likewise, it is no longer possible to update an existing professor and setting their first or last name to NULL.

What happens if you try to enter NULLs?

Execute the following statement:

INSERT INTO professors (firstname, lastname, university_shortname)
VALUES (NULL, 'Miller', 'ETH');

Why does this throw an error?

Possible Answers

  • Professors without first names do not exist.
  • Because a database constraint is violated.
  • Error? This works just fine.
  • NULL is not put in quotes.

Make your columns UNIQUE with ADD CONSTRAINT

As seen in the video, you add the UNIQUE keyword after the column_name that should be unique. This, of course, only works for new tables:

CREATE TABLE table_name (
 column_name UNIQUE
);

If you want to add a unique constraint to an existing table, you do it like that:

ALTER TABLE table_name
ADD CONSTRAINT some_name UNIQUE(column_name);

Note that this is different from the ALTER COLUMN syntax for the not-null constraint. Also, you have to give the constraint a name some_name.

Instructions 1/2

  • Add a unique constraint to the university_shortname column in universities. Give it the name university_shortname_unq.
-- Make universities.university_shortname unique
ALTER TABLE universities
ADD CONSTRAINT university_shortname_unq UNIQUE(university_shortname);
In [29]:
stmt = "\
ALTER TABLE uni.universities \
ADD CONSTRAINT university_shortname_unq UNIQUE(university_shortname); \
"
conn.execute(stmt)
Out[29]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e8254c8>

Instructions 2/2

  • Add a unique constraint to the organization column in organizations. Give it the name organization_unq.
-- Make organizations.organization unique
ALTER TABLE organizations
ADD CONSTRAINT organization_unq UNIQUE(organization);
In [30]:
stmt = "\
ALTER TABLE uni.organizations \
ADD CONSTRAINT organization_unq UNIQUE(organization); \
"
conn.execute(stmt)
Out[30]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e8232c8>

Perfect. Making sure universities.university_shortname and organizations.organization only contain unique values is a prerequisite for turning them into so-called primary keys – the subject of the next chapter!

Uniquely identify records with key constraints

Now let's get into the best practices of database engineering. It's time to add primary and foreign keys to the tables. These are two of the most important concepts in databases, and are the building blocks you'll use to establish relationships between tables.

Keys and superkeys

  • The previous constraint changes, didn't change the structure of the database model
  • The database will look slightly different after changing key constraints
  • Primary keys, id, will be added to three different tables (organizations, professors and universities)
  • Keys are denoted by underlined attribute names

What is a key?

  • Attribute(s) that uniquely identify a record
  • Normally, a table, as a whole, only contains unique records, meaning the combination of all attributes is a key in itself.
  • As long as attributes from that combination can be removed and the attributes still uniquely identify records, it's called a superkey
  • If all possible attributes have been removed, but the records are still uniquely identifiable by the remaining attributes, we speak of a minimal superkey or key
    • A key is always minimal
| license_no         | serial_no | make       | model   | year |
|--------------------|-----------|------------|---------|------|
| Texas ABC-739      | A69352    | Ford       | Mustang | 2    |
| Florida TVP-347    | B43696    | Oldsmobile | Cutlass | 5    |
| New York MPO-22    | X83554    | Oldsmobile | Delta   | 1    |
| California 432-TFY | C43742    | Mercedes   | 190-D   | 99   |
| California RSK-629 | Y82935    | Toyota     | Camry   | 4    |
| Texas RSK-629      | U028365   | Jaguar     | XJS     | 4    |

Superkeys

  • SK1 = {license_no, serial_no, make, model, year}
  • SK2 = {license_no, serial_no, make, model}
  • SK3 = {make, model, year}
  • SK4 = {license_no, serial_no}
  • SKi, ..., SKn
  • The table shows six different cars, so the combination of all attributes is a superkey
  • If we remove the year attribute from the superkey, the six records are still unique, so it's still a superkey
  • There are many possible superkeys in the example
  • There are only 4 minimal superkeys

Candidate keys

  • K1 = {license_no}
  • K2 = {serial_no}
  • K3 = {model}
  • K4 = {make, year}
  • Remember, superkeys are minimal if no attributes can be removed without losing the uniqueness property.
    • The is trivial for K1 - K3, as they only consist of a single attribute.
    • If we remove year from K4, make would contain duplicates, and would, therefore, no longer be suited as a key
  • These four minimal superkeys are also called candidate keys
    • Only one candidate key can be the chosen key

Get to know SELECT COUNT DISTINCT

Your database doesn't have any defined keys so far, and you don't know which columns or combinations of columns are suited as keys.

There's a simple way of finding out whether a certain column (or a combination) contains only unique values - and thus identifies the records in the table.

You already know the SELECT DISTINCT query from the first chapter. Now you just have to wrap everything within the COUNT() function and PostgreSQL will return the number of unique rows for the given columns:

SELECT COUNT(DISTINCT(column_a, column_b, ...))
FROM table;

Instructions 1/2

  • First, find out the number of rows in universities.
-- Count the number of rows in universities
SELECT count(*) 
FROM universities;
In [31]:
stmt = "\
SELECT count(*) \
FROM uni.universities; \
"
res = conn.execute(stmt)
res.fetchall()
Out[31]:
[(11,)]

Instructions 2/2

  • Then, find out how many unique values there are in the university_city column.
-- Count the number of distinct values in the university_city column
SELECT COUNT(DISTINCT(university_city)) 
FROM universities;
In [32]:
stmt = "\
SELECT COUNT(DISTINCT(university_city)) \
FROM uni.universities; \
"
res = conn.execute(stmt)
res.fetchall()
Out[32]:
[(9,)]

The university_city column wouldn't lend itself as a key because there are only 9 distinct values, but the table has 11 rows.

Identify keys with SELECT COUNT DISTINCT

There's a very basic way of finding out what qualifies for a key in an existing, populated table:

  1. Count the distinct records for all possible combinations of columns. If the resulting number x equals the number of all rows in the table for a combination, you have discovered a superkey.

  2. Then remove one column after another until you can no longer remove columns without seeing the number x decrease. If that is the case, you have discovered a (candidate) key.

The table professors has 551 rows. It has only one possible candidate key, which is a combination of two attributes. You might want to try different combinations using the "Run code" button. Once you have found the solution, you can submit your answer.

Instructions

  • Using the above steps, identify the candidate key by trying out different combination of columns.
In [33]:
attribs = ['firstname', 'lastname', 'university_shortname']
combos = sum([list(map(list, combinations(attribs, i))) for i in range(len(attribs) + 1)[1:]], [])
for x in combos:
    att_str = ', '.join(x)
    stmt = f"SELECT COUNT(DISTINCT({att_str})) FROM uni.professors;"
    res = conn.execute(stmt)
    print(att_str, res.fetchall())
firstname [(360,)]
lastname [(534,)]
university_shortname [(11,)]
firstname, lastname [(551,)]
firstname, university_shortname [(479,)]
lastname, university_shortname [(546,)]
firstname, lastname, university_shortname [(551,)]

The only combination that uniquely identifies professors is {firstname, lastname}. {firstname, lastname, university_shortname} is a superkey, and all other combinations give duplicate values. Hopefully, the concept of superkeys and keys is now a bit more clear. Let's move on to primary keys!

Primary keys

  • Primary keys are one of the most important concepts in database design.
  • Almost every database table whold ahve a primary key - chosen from the set of candidate keys
  • The main purpose is uniquely identifying records in a table
    • This makes it easier to reference these records from other tables.
  • Primary keys need to be defined on columns that don't accept duplicate or null values
  • Primary key constraints are time-invariant, meaning they must hold for the current data in the table and for any future data the might be added to the table.
  • Choose columns where values will always be unique and not null.

Specifying primary keys

  • The following two tables accept exactly the same data, however, the latter has an explicit primary key specified.
CREATE TABLE products (
    product_no integer UNIQUE NOT NULL,
    name text,
    price numeric
);

CREATE TABLE products (
    product_no integer PRIMARY KEY,
    name text,
    price numeric
);
  • Primary keys can also be specified as follows
    • This notation is necessary if you want to designate more than one column as the primary keys.
    • This is still only one primary key, it's just formed by the combination of two columns.
    • Ideally, primary keys consist of as few columns as possible.
CREATE TABLE example (
    a integer,
    b integer,
    c integer,
    PRIMARY KEY (a, c)
);
  • Adding primary key constraints to existing tables is the same procedure as adding unique constraints.
  • As with unique constraints, the constraint must be given a certain name.
ALTER TABLE table_name
ADD CONSTRAINT some_name PRIMARY KEY (column_name)
  • In the exercises that follow, add primary keys to the tables universities and organizations.
  • Add a special type of primary key, a surrogate key, to the table professors in the last part of this chapter.

database

Identify the primary key

Have a look at the example table from the previous video. As the database designer, you have to make a wise choice as to which column should be the primary key.

     license_no     | serial_no |    make    |  model  | year
--------------------+-----------+------------+---------+------
 Texas ABC-739      | A69352    | Ford       | Mustang |    2
 Florida TVP-347    | B43696    | Oldsmobile | Cutlass |    5
 New York MPO-22    | X83554    | Oldsmobile | Delta   |    1
 California 432-TFY | C43742    | Mercedes   | 190-D   |   99
 California RSK-629 | Y82935    | Toyota     | Camry   |    4
 Texas RSK-629      | U028365   | Jaguar     | XJS     |    4

Which of the following column or column combinations could best serve as primary key?

Possible Answers

  • PK = {make}
  • PK = {model, year}
  • PK = {license_no}
  • PK = {year, make}
  • A primary key consisting solely of license_no is probably the wisest choice, as license numbers are certainly unique across all registered cars in a country.

ADD key CONSTRAINTs to the tables

Two of the tables in your database already have well-suited candidate keys consisting of one column each: organizations and universities with the organization and university_shortname columns, respectively.

In this exercise, you'll rename these columns to id using the RENAME COLUMN command and then specify primary key constraints for them. This is as straightforward as adding unique constraints (see the last exercise of Chapter 2):

ALTER TABLE table_name
ADD CONSTRAINT some_name PRIMARY KEY (column_name)

Note that you can also specify more than one column in the brackets.

Instructions 1/2

  • Rename the organization column to id in organizations.
  • Make id a primary key and name it organization_pk.
-- Rename the organization column to id
ALTER TABLE organizations
RENAME COLUMN organization TO id;

-- Make id a primary key
ALTER TABLE organizations
ADD CONSTRAINT organization_pk PRIMARY KEY (id);
In [34]:
stmt = "\
ALTER TABLE uni.organizations \
RENAME COLUMN organization TO id; \
"
conn.execute(stmt)

stmt = "\
ALTER TABLE uni.organizations \
ADD CONSTRAINT organization_pk PRIMARY KEY (id); \
"
conn.execute(stmt)
Out[34]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e817108>

Instructions 2/2

  • Rename the university_shortname column to id in universities.
  • Make id a primary key and name it university_pk.
-- Rename the university_shortname column to id
ALTER TABLE universities
RENAME COLUMN university_shortname TO id;

-- Make id a primary key
ALTER TABLE universities
ADD CONSTRAINT university_pk PRIMARY KEY (id);
In [35]:
stmt = "\
ALTER TABLE uni.universities \
RENAME COLUMN university_shortname TO id; \
"
conn.execute(stmt)

stmt = "\
ALTER TABLE uni.universities \
ADD CONSTRAINT university_pk PRIMARY KEY (id); \
"
conn.execute(stmt)
Out[35]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e81d988>

Surrogate keys

  • Surrogate keys are sort of an artificial primary key.
    • They aren't based on a native column in the data, but on a column that just exist for the sake of having a primary key
  • There are several reason for creating an artificial surrogate key.
    • A primary key is ideally constructed from as few columns as possible.
    • The primary key of a record should never change over time.
  • If you define an artificial primary key, ideally consisting of a unique number or string, you can be sure that this number stays the same for each record.
  • Other attributes might change, but the primary key always has the same value for a given record.
| license_no         | serial_no | make       | model   | color     |
|--------------------|-----------|------------|---------|-----------|
| Texas ABC-739      | A69352    | Ford       | Mustang | blue      |
| Florida TVP-347    | B43696    | Oldsmobile | Cutlass | black     |
| New York MPO-22    | X83554    | Oldsmobile | Delta   | silver    |
| California 432-TFY | C43742    | Mercedes   | 190-D   | champagne |
| California RSK-629 | Y82935    | Toyota     | Camry   | red       |
| Texas RSK-629      | U028365   | Jaguar     | XJS     | blue      |
  • In this table, the license_no column would be suited as the primary key - the license number is unlikely to change over time, not like the color column, for example, which might change if the car is repainted.
  • There's no need for a surrogate key here.
  • However, lets say there were only the following three attributes in the table.
| make       | model   | color     |
|------------|---------|-----------|
| Ford       | Mustang | blue      |
| Oldsmobile | Cutlass | black     |
| Oldsmobile | Delta   | silver    |
| Mercedes   | 190-D   | champagne |
| Toyota     | Camry   | red       |
| Jaguar     | XJS     | blue      |
  • The only sensible primary key would be the combination of make and model, but that's two columns for the primary key.
  • You could add a new surrogate key column, called id, to solve this problem.
  • Actually there's a special data type in PostgreSQL that allows the addition of auto-incrementing numbers to an existing table: the serial type.
  • It's specified just like any other data type.
  • Once you add a column with the serial type, all the records in your table will be numbered.
ALTER TABLE cars
ADD COLUMN id serial PRIMARY KEY;
INSERT INTO cars
VALUES ('Volkswagen', 'Blitz', 'black');
| make       | model   | color     | id |
|------------|---------|-----------|----|
| Ford       | Mustang | blue      | 1  |
| Oldsmobile | Cutlass | black     | 2  |
| Oldsmobile | Delta   | silver    | 3  |
| Mercedes   | 190-D   | champagne | 4  |
| Toyota     | Camry   | red       | 5  |
| Jaguar     | XJS     | blue      | 6  |
| Volkswaven | Blitz   | black     | 7  |
  • Whenever you add a new record to the table, it will automatically get a number that does not exist yet.
INSERT INTO cars
VALUES ('Opel', 'Astra', 'green', 1);

_duplicate key value violates unique constraint "id_pkey"DETAIL: Key (id)=(1) already exists._

  • If you try to specify an ID that already exists, the primary key constraint will prevent you from doing so.
  • After all, the id column uniquely identifies each record in this table, which is very useful, for example, when you want to refer to these records from another table.

Another type of surrogate key

  • Combine two existing columns into a new one
ALTER TABLE table_name
ADD COLUMN column_c varchar(256);

UPDATE table_name
SET column_c = CONCAT(column_a, column_b);

ALTER TABLE table_name
ADD CONSTRAINT pk PRIMARY KEY (column_c);
  1. Add a new column with the varchar data type
  2. UPDATE that column with the concatenation of two existing columns
    • The CONCAT function glues together the values of two or more existing columns.
  3. Turn the new column into a surrogate primary key.
  • In the exercises, add a surrogate key to the professors table, because the existing attributes are not really suited as primary keys.
  • There could be more than one professor with the same name working for one university, resulting in duplicates.
  • With an auto-incrementing id column as the primary key, you make sure that each professor can be uniquely referred to.
  • This was not necessary for organizations and universities, as their names can be assumed to be unique across these tables.
    • It is unlikely that two organizations with the same name exist, solely for trademark reasons; the same for universities.

Add a SERIAL surrogate key

Since there's no single column candidate key in professors (only a composite key candidate consisting of firstname, lastname), you'll add a new column id to that table.

This column has a special data type serial, which turns the column into an auto-incrementing number. This means that, whenever you add a new professor to the table, it will automatically get an id that does not exist yet in the table: a perfect primary key!

Instructions 1/3

  • Add a new column id with data type serial to the professors table.
-- Add the new column to the table
ALTER TABLE professors 
ADD COLUMN id serial;
In [36]:
stmt = "\
ALTER TABLE uni.professors \
ADD COLUMN id serial; \
"
conn.execute(stmt)
Out[36]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e81d588>

Instructions 2/3

Make id a primary key and name it professors_pkey.

-- Make id a primary key
ALTER TABLE professors
ADD CONSTRAINT professors_pkey PRIMARY KEY (id);
In [37]:
stmt = "\
ALTER TABLE uni.professors \
ADD CONSTRAINT professors_pkey PRIMARY KEY (id); \
"
conn.execute(stmt)
Out[37]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e825108>

Instructions 3/3

Write a query that returns all the columns and 10 rows from professors.

SELECT * FROM professors
LIMIT 10;
In [38]:
stmt = "\
SELECT * FROM uni.professors \
LIMIT 10; \
"

res = conn.execute(stmt)
res.fetchall()
Out[38]:
[('Michel', 'Rappaz', 'EPF', 1),
 ('Hilal', 'Lashuel', 'EPF', 2),
 ('Jeffrey', 'Huang', 'EPF', 3),
 ('Pierre', 'Magistretti', 'EPF', 4),
 ('Paolo', 'Ienne', 'EPF', 5),
 ('Frédéric', 'Kaplan', 'EPF', 6),
 ('Olivier', 'Hari', 'UNE', 7),
 ('Christian', 'Hesse', 'UBE', 8),
 ('Majed', 'Chergui', 'EPF', 9),
 ('Douglas', 'Hanahan', 'EPF', 10)]

As you can see, PostgreSQL has automatically numbered the rows with the id column, which now functions as a (surrogate) primary key – it uniquely identifies professors.

CONCATenate columns to a surrogate key

Another strategy to add a surrogate key to an existing table is to concatenate existing columns with the CONCAT() function.

Let's think of the following example table:

CREATE TABLE cars (
 make varchar(64) NOT NULL,
 model varchar(64) NOT NULL,
 mpg integer NOT NULL
)

The table is populated with 10 rows of completely fictional data.

Unfortunately, the table doesn't have a primary key yet. None of the columns consists of only unique values, so some columns can be combined to form a key.

In the course of the following exercises, you will combine make and model into such a surrogate key.

Instructions 1/4

  • Count the number of distinct rows with a combination of the make and model columns.
-- Count the number of distinct rows with columns make, model
SELECT COUNT(DISTINCT(make, model))
FROM cars;
In [39]:
# Create a new schema and table
conn.execute('CREATE SCHEMA auto;')
conn.execute(' \
CREATE TABLE auto.cars ( \
 "make" varchar(64) NOT NULL, \
 "model" varchar(64) NOT NULL, \
 "mpg" integer NOT NULL \
);')
Out[39]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e827688>
In [40]:
data = {'make': ['Subaru', 'Opel', 'Opel', 'Ford', 'Ford', 'Toyota', 'Toyota', 'Toyota', 'Mitsubishi', 'Mitsubishi'],
        'model': ['Forester', 'Astra', 'Vectra', 'Avenger', 'Galaxy', 'Prius', 'Speedster', 'Galaxy', 'Forester', 'Galaxy'],
        'mpg': [24, 45, 40, 30, 30, 50, 30, 20, 10, 30]}
cars = pd.DataFrame(data)
In [41]:
cars.to_sql('cars', con=engine, schema='auto', index=False, if_exists='replace')
In [42]:
stmt = "\
SELECT COUNT(DISTINCT(make, model)) \
FROM auto.cars; \
"

res = conn.execute(stmt)
res.fetchall()
Out[42]:
[(10,)]

Instructions 2/4

Add a new column id with the data type varchar(128).

-- Add the id column
ALTER TABLE cars
ADD COLUMN id varchar(128);
In [43]:
stmt = "\
ALTER TABLE auto.cars \
ADD COLUMN id varchar(128); \
"

conn.execute(stmt)
Out[43]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0b001588>

Instructions 3/4

Concatenate make and model into id using an UPDATE table_name SET column_name = ... query and the CONCAT() function.

-- Update id with make + model
UPDATE cars
SET id = CONCAT(make, model);
In [44]:
stmt = "\
UPDATE auto.cars \
SET id = CONCAT(make, model); \
"

conn.execute(stmt)
Out[44]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e825d08>

Instructions 4/4

Make id a primary key and name it id_pk.

-- Make id a primary key
ALTER TABLE cars
ADD CONSTRAINT id_pk PRIMARY KEY(id);

-- Have a look at the table
SELECT * FROM cars;
In [45]:
stmt = "\
ALTER TABLE auto.cars \
ADD CONSTRAINT id_pk PRIMARY KEY(id); \
"
conn.execute(stmt)

res = conn.execute("SELECT * FROM auto.cars")
res.fetchall()
Out[45]:
[('Subaru', 'Forester', 24, 'SubaruForester'),
 ('Opel', 'Astra', 45, 'OpelAstra'),
 ('Opel', 'Vectra', 40, 'OpelVectra'),
 ('Ford', 'Avenger', 30, 'FordAvenger'),
 ('Ford', 'Galaxy', 30, 'FordGalaxy'),
 ('Toyota', 'Prius', 50, 'ToyotaPrius'),
 ('Toyota', 'Speedster', 30, 'ToyotaSpeedster'),
 ('Toyota', 'Galaxy', 20, 'ToyotaGalaxy'),
 ('Mitsubishi', 'Forester', 10, 'MitsubishiForester'),
 ('Mitsubishi', 'Galaxy', 30, 'MitsubishiGalaxy')]

Test your knowledge before advancing

Before you move on to the next chapter, let's quickly review what you've learned so far about attributes and key constraints. If you're unsure about the answer, please quickly review chapters 2 and 3, respectively.

Let's think of an entity type "student". A student has:

  • a last name consisting of up to 128 characters (this cannot contain a missing value),
  • a unique social security number of length 9, consisting only of integers,
  • a phone number of fixed length 12, consisting of numbers and characters (but some students don't have one).

Instructions

  • Given the above description of a student entity, create a table students with the correct column types.
  • Add a PRIMARY KEY for the social security number ssn.
  • Note that there is no formal length requirement for the integer column. The application would have to make sure it's a correct SSN!
-- Create the table
CREATE TABLE students (
  last_name char(128) NOT NULL,
  ssn integer[9] UNIQUE,
  phone_no varchar(12),
  PRIMARY KEY (ssn)
);
In [46]:
conn.execute(' \
CREATE TABLE uni.students ( \
 "last_name" char(128) NOT NULL, \
 "ssn" integer[9] UNIQUE, \
 "phone_no" varchar(12), \
 PRIMARY KEY (ssn) \
);')
Out[46]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e817fc8>

Glue together tables with foreign keys

In the final chapter, you'll leverage foreign keys to connect tables and establish relationships that will greatly benefit your data quality. And you'll run ad hoc analyses on your new database.

Model 1:N relationships with foreign keys

Current DB model

db_model

  • The three entity types professors, organizations, and universities all have primary keys, but affiliations doesn't, for a specific reason that will be revealed in this chapter.

Next DB model

relationship

  • You'll model a so-called relationship type between professors and universities.
  • In your database, each professor work for a university.
    • In the ER diagram, this is drawn with a rhombus.
    • The small numbers specify the cardinality of the relationship: a professor works for at most one university, while a university can have any number of professors working for it - even zero.
  • Such relationships are implemented with foreign keys.
    • Foreign keys are designated columns that point to a primary key of another table. A FK points to the PK of another table.
    • There are some restrictions for foreign keys.
    1. The domain and the data type must be the same as the one of the primary key. Domain of FK and PK must be equal.
    2. Only foreign key values are allowed that exist as values in the primary key of the referenced table. Each value of FK must exist in PK of the other table (FK constraint or referential integrity).
    • This is the actual foreign key constraint, also called referential integrity.
  • FKs are not actual keys
    • A foreign key is not necessarily an actual key, because duplicates and NULL values are allowed.
SELECT * FROM professors LIMIT 8;
| id | firstname       | lastname    | university_shortname |
|----|-----------------|-------------|----------------------|
| 1  | Karl            | Aberer      | EPF                  |
| 2  | Reza Shokrollah | Abhari      | ETH                  |
| 3  | Georges         | Abou Jaoudé | EPF                  |
| 4  | Hugues          | Abriel      | UBE                  |
| 5  | Daniel          | Aebersold   | UBE                  |
| 6  | Marcelo         | Aebi        | ULA                  |
| 7  | Christoph       | Aebi        | UBE                  |
| 8  | Patrick         | Aebischer   | EPF                  |
SELECT * FROM universities;
| id  | university     | university_city |
|-----|----------------|-----------------|
| EPF | ETH Lausanne   | Lausanne        |
| ETH | ETH Zürich     | Zurich          |
| UBA | Uni Basel      | Basel           |
| UBE | Uni Bern       | Bern            |
| UFR | Uni Freiburg   | Fribourg        |
| UGE | Uni Genf       | Geneva          |
| ULA | Uni Lausanne   | Lausanne        |
| UNE | Uni Neuenburg  | Neuchâtel       |
| USG | Uni St. Gallen | Saint Gallen    |
| USI | USI Lugano     | Lugano          |
| UZH | Uni Zürich     | Zurich          |
  • The column university_shortname of professors has the same domain as the id column of the universities table
  • If you go through each record of professors, you can always find the respective id in the universities table.
  • So both criteria for a foreign key in the talbe professor referencing universities are fulfilled.
  • university_shortname is not really a key because there are duplicates.
    • EPF and UBE occur three times each.

Specifying foreign keys

CREATE TABLE manufacturers (
 name varchar(255) PRIMARY KEY
);

INSERT INTO manufacturers 
VALUES ('Ford'), ('VW'), ('GM');

CREATE TABLE cars (
 model varchar(255) PRIMARY KEY,
 manufacturer_name integer REFERENCES manufacturers (name)
);

INSERT INTO cars 
VALUES ('Ranger', 'Ford'), ('Beetle', 'VW');
  • When you create a new table, you can specify a foreign key similarly to a primary key
  1. Create a manufacturers table with a primary key called name.
  2. Create a table cars, that also has a primary key, called model.
  • As each car is produced by a certain manufacturer, it makes sense to also add a foreign key to this table.
    • We do that by writing the REFERENCES keyword, followed by the referenced table and its primary key in brackets.
  • From now on, only cars with valid and existing manufacturers may be entered into that table.
-- Throws an error!
INSERT INTO cars 
VALUES ('Tundra', 'Toyota');
  • Trying to enter models with manufacturers that are not yet stored in the manufacturers table won't be possible, thanks to the foreign key constraint.

Specifying foreign keys to existing tables

ALTER TABLE a 
ADD CONSTRAINT a_fkey FOREIGN KEY (b_id) REFERENCES b (id);
  • The syntax for adding foreign keys to existing tables is the same as the one for adding primary keys and unique constraints.

REFERENCE a table with a FOREIGN KEY

In your database, you want the professors table to reference the universities table. You can do that by specifying a column in professors table that references a column in the universities table.

As just shown in the video, the syntax for that looks like this:

ALTER TABLE a 
ADD CONSTRAINT a_fkey FOREIGN KEY (b_id) REFERENCES b (id);

Table a should now refer to table b, via b_id, which points to id. a_fkey is, as usual, a constraint name you can choose on your own.

Pay attention to the naming convention employed here: Usually, a foreign key referencing another primary key with name id is named x_id, where x is the name of the referencing table in the singular form.

Instructions 1/2

  • Rename the university_shortname column to university_id in professors.
-- Rename the university_shortname column
ALTER TABLE professors
RENAME COLUMN university_shortname to university_id;
In [47]:
stmt = "\
ALTER TABLE uni.professors \
RENAME COLUMN university_shortname to university_id; \
"
conn.execute(stmt)
Out[47]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0ea832c8>

Instructions 2/2

  • Add a foreign key on university_id column in professors that references the id column in universities.
  • Name this foreign key professors_fkey.
-- Add a foreign key on professors referencing universities
ALTER TABLE professors 
ADD CONSTRAINT professors_fkey FOREIGN KEY (university_id) REFERENCES universities (id);
In [48]:
stmt = "\
ALTER TABLE uni.professors \
ADD CONSTRAINT professors_fkey FOREIGN KEY (university_id) REFERENCES uni.universities (id); \
"
conn.execute(stmt)
Out[48]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0ea833c8>

Now, the professors table has a link to the universities table. Each professor belongs to exactly one university.

Explore foreign key constraints

Foreign key constraints help you to keep order in your database mini-world. In your database, for instance, only professors belonging to Swiss universities should be allowed, as only Swiss universities are part of the universities table.

The foreign key on professors referencing universities you just created thus makes sure that only existing universities can be specified when inserting new data. Let's test this!

Instructions

  • Run the sample code and have a look at the error message.
-- Try to insert a new professor
INSERT INTO professors (firstname, lastname, university_id)
VALUES ('Albert', 'Einstein', 'MIT');

insert or update on table "professors" violates foreign key constraint "professors_fkey"

DETAIL: Key (university_id)=(MIT) is not present in table "universities".

  • What's wrong? Correct the university_id so that it actually reflects where Albert Einstein wrote his dissertation and became a professor – at the University of Zurich (UZH)!
-- Try to insert a new professor
INSERT INTO professors (firstname, lastname, university_id)
VALUES ('Albert', 'Einstein', 'UZH');
In [49]:
stmt = "\
INSERT INTO uni.professors (firstname, lastname, university_id) \
VALUES ('Albert', 'Einstein', 'MIT'); \
"
conn.execute(stmt)
---------------------------------------------------------------------------
ForeignKeyViolation                       Traceback (most recent call last)
e:\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1245                     self.dialect.do_execute(
-> 1246                         cursor, statement, parameters, context
   1247                     )

e:\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py in do_execute(self, cursor, statement, parameters, context)
    587     def do_execute(self, cursor, statement, parameters, context=None):
--> 588         cursor.execute(statement, parameters)
    589 

ForeignKeyViolation: insert or update on table "professors" violates foreign key constraint "professors_fkey"
DETAIL:  Key (university_id)=(MIT) is not present in table "universities".


The above exception was the direct cause of the following exception:

IntegrityError                            Traceback (most recent call last)
<ipython-input-49-23fceda85b9e> in <module>
      3 VALUES ('Albert', 'Einstein', 'MIT'); \
      4 "
----> 5 conn.execute(stmt)

e:\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py in execute(self, object_, *multiparams, **params)
    974         """
    975         if isinstance(object_, util.string_types[0]):
--> 976             return self._execute_text(object_, multiparams, params)
    977         try:
    978             meth = object_._execute_on_connection

e:\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py in _execute_text(self, statement, multiparams, params)
   1147             parameters,
   1148             statement,
-> 1149             parameters,
   1150         )
   1151         if self._has_events or self.engine._has_events:

e:\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1248         except BaseException as e:
   1249             self._handle_dbapi_exception(
-> 1250                 e, statement, parameters, cursor, context
   1251             )
   1252 

e:\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
   1474                 util.raise_from_cause(newraise, exc_info)
   1475             elif should_wrap:
-> 1476                 util.raise_from_cause(sqlalchemy_exception, exc_info)
   1477             else:
   1478                 util.reraise(*exc_info)

e:\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py in raise_from_cause(exception, exc_info)
    396     exc_type, exc_value, exc_tb = exc_info
    397     cause = exc_value if exc_value is not exception else None
--> 398     reraise(type(exception), exception, tb=exc_tb, cause=cause)
    399 
    400 

e:\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py in reraise(tp, value, tb, cause)
    150             value.__cause__ = cause
    151         if value.__traceback__ is not tb:
--> 152             raise value.with_traceback(tb)
    153         raise value
    154 

e:\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1244                 if not evt_handled:
   1245                     self.dialect.do_execute(
-> 1246                         cursor, statement, parameters, context
   1247                     )
   1248         except BaseException as e:

e:\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py in do_execute(self, cursor, statement, parameters, context)
    586 
    587     def do_execute(self, cursor, statement, parameters, context=None):
--> 588         cursor.execute(statement, parameters)
    589 
    590     def do_execute_no_params(self, cursor, statement, context=None):

IntegrityError: (psycopg2.errors.ForeignKeyViolation) insert or update on table "professors" violates foreign key constraint "professors_fkey"
DETAIL:  Key (university_id)=(MIT) is not present in table "universities".

[SQL: INSERT INTO uni.professors (firstname, lastname, university_id) VALUES ('Albert', 'Einstein', 'MIT'); ]
(Background on this error at: http://sqlalche.me/e/gkpj)
In [50]:
stmt = "\
INSERT INTO uni.professors (firstname, lastname, university_id) \
VALUES ('Albert', 'Einstein', 'UZH'); \
"
conn.execute(stmt)
Out[50]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e81b508>

As you can see, inserting a professor with non-existing university IDs violates the foreign key constraint you've just added. This also makes sure that all universities are spelled equally – adding to data consistency.

JOIN tables linked by a foreign key

Let's join these two tables to analyze the data further!

You might already know how SQL joins work from the Intro to SQL for Data Science course (last exercise) or from Joining Data in PostgreSQL.

Here's a quick recap on how joins generally work:

SELECT ...
FROM table_a
JOIN table_b
ON ...
WHERE ...

While foreign keys and primary keys are not strictly necessary for join queries, they greatly help by telling you what to expect. For instance, you can be sure that records referenced from table A will always be present in table B – so a join from table A will always find something in table B. If not, the foreign key constraint would be violated.

Instructions

  • JOIN professors with universities on professors.university_id = universities.id, i.e., retain all records where the foreign key of professors is equal to the primary key of universities.
  • Filter for university_city = 'Zurich'.
-- Select all professors working for universities in the city of Zurich
SELECT professors.lastname, universities.id, universities.university_city
FROM professors
JOIN universities
ON professors.university_id = universities.id
WHERE universities.university_city = 'Zurich';
In [51]:
stmt = "\
SELECT professors.lastname, universities.id, universities.university_city \
FROM uni.professors \
JOIN uni.universities \
ON professors.university_id = universities.id \
WHERE universities.university_city = 'Zurich'; \
"

res = conn.execute(stmt)
res.fetchall()
Out[51]:
[('Grote', 'ETH', 'Zurich'),
 ('Carreira', 'ETH', 'Zurich'),
 ('Wennemers', 'ETH', 'Zurich'),
 ('Hafen', 'ETH', 'Zurich'),
 ('Saar', 'ETH', 'Zurich'),
 ('Loeliger', 'ETH', 'Zurich'),
 ('Schönsleben', 'ETH', 'Zurich'),
 ('Bühler', 'UZH', 'Zurich'),
 ('Hellweg', 'ETH', 'Zurich'),
 ('Gugerli', 'UZH', 'Zurich'),
 ('Maurer', 'ETH', 'Zurich'),
 ('Axhausen', 'ETH', 'Zurich'),
 ('Schulin', 'ETH', 'Zurich'),
 ('Steinfeld', 'ETH', 'Zurich'),
 ('Morbidelli', 'ETH', 'Zurich'),
 ('Levine', 'ETH', 'Zurich'),
 ('Stalder', 'ETH', 'Zurich'),
 ('Hungerbühler', 'ETH', 'Zurich'),
 ('Diederich', 'ETH', 'Zurich'),
 ('Abhari', 'ETH', 'Zurich'),
 ('Kohler', 'ETH', 'Zurich'),
 ('Faist', 'ETH', 'Zurich'),
 ('Gramazio', 'ETH', 'Zurich'),
 ('Widmer', 'ETH', 'Zurich'),
 ('Sornette', 'ETH', 'Zurich'),
 ('Guzzella', 'ETH', 'Zurich'),
 ('Rudolf von Rohr', 'ETH', 'Zurich'),
 ('Krek', 'ETH', 'Zurich'),
 ('Vogel', 'ETH', 'Zurich'),
 ('Tröster', 'ETH', 'Zurich'),
 ('Eberle', 'ETH', 'Zurich'),
 ('Weidmann', 'ETH', 'Zurich'),
 ('Wolfrum', 'ETH', 'Zurich'),
 ('Günther', 'ETH', 'Zurich'),
 ('Lanzavecchia', 'ETH', 'Zurich'),
 ('Ermanni', 'ETH', 'Zurich'),
 ('Meboldt', 'ETH', 'Zurich'),
 ('Bechtold', 'ETH', 'Zurich'),
 ('Boutellier', 'ETH', 'Zurich'),
 ('Wegener', 'ETH', 'Zurich'),
 ('Kaufmann', 'ETH', 'Zurich'),
 ('Fontana', 'ETH', 'Zurich'),
 ('Menz', 'ETH', 'Zurich'),
 ('Schlüter', 'ETH', 'Zurich'),
 ('Székely', 'ETH', 'Zurich'),
 ('Widmayer', 'ETH', 'Zurich'),
 ('von Krogh', 'ETH', 'Zurich'),
 ('Morari', 'ETH', 'Zurich'),
 ('Robertsson', 'ETH', 'Zurich'),
 ('Baschera', 'ETH', 'Zurich'),
 ('Szucs', 'UZH', 'Zurich'),
 ('Nelson', 'ETH', 'Zurich'),
 ("D'Andrea", 'ETH', 'Zurich'),
 ('Heinrich', 'ETH', 'Zurich'),
 ('Gademann', 'UZH', 'Zurich'),
 ('Poulikakos', 'ETH', 'Zurich'),
 ('Jenny', 'ETH', 'Zurich'),
 ('Capkun', 'ETH', 'Zurich'),
 ('Klumpner', 'ETH', 'Zurich'),
 ('Meili', 'ETH', 'Zurich'),
 ('Gruber', 'ETH', 'Zurich'),
 ('Jackowski', 'UZH', 'Zurich'),
 ('Gruissem', 'ETH', 'Zurich'),
 ('Holdenrieder', 'ETH', 'Zurich'),
 ('Siegwart', 'ETH', 'Zurich'),
 ('Schubert', 'ETH', 'Zurich'),
 ('Neri', 'ETH', 'Zurich'),
 ('Wüest', 'ETH', 'Zurich'),
 ('Frossard', 'ETH', 'Zurich'),
 ('Chen', 'ETH', 'Zurich'),
 ('Basin', 'ETH', 'Zurich'),
 ('Gross', 'ETH', 'Zurich'),
 ('Einstein', 'UZH', 'Zurich')]

First, the university belonging to each professor was attached with the JOIN operation. Then, only professors having "Zurich" as university city were retained with the WHERE clause.

Model more complex relationships

  • In the last few exercises, you made your first steps in modeling and implementing 1:N-relationships between professors and universities.
  • Lets look at more complex relationships.
  • 1:N-relationships are implemented with one foreign key in the table that has at most one foreign entity associated.
  • In this case, that's the professors table, as professors cannot have more than one university associated.
  • What about affiliations?
    • We know that a professor can have more than one affiliation with organizations, for instance, as a chairman of a bank and as a president of a golf club.
    • On the other hand, organizations can also have more than one professor connected with them
    • Let's look at the entity-relationship diagram that models this.

final_model

  • There are a couple of things that are new.
  1. A new relationship between organizations and professors was added.
    • This is an N:M relationship
    • This depicts the fact that a professor can be affiliated with more than one organization and vice versa.
    • Also, it has an own attribute, the function.
    • Remember that each affiliation comes with a function, for instance, chariman.
  2. The affiliations entity type disappeared altogether (shadowed in the diagram).
    • You'll still have four tables: three for the entities professors, universities, and organizations, and one for the N:M relationship between professors and organizations.

How to implement N:M-relationships

  • Create a table. Such a relationship is implemented with an ordinary database table that contains two foreign keys that point to both connected entities.
  • Add foreign keys for every connected table. In this case, that's a foreign key pointing to the professors.id column and one pointing to the organizations.id column.
  • Add additional attributes. In this case, function.
CREATE TABLE affiliations (
 professor_id integer REFERENCES professors (id),
 organization_id varchar(256) REFERENCES organization (id),
 function varchar(256)
);
  • If you were to create that relationship table from scratch, you would define it as shown.
  • The professor_id is stored as integer, as the primary key it refers to has the type serial, which is also an integer.
  • organization_id has varchar(256) as type, conforming to the primary key in the organizations table.
  • Notice that no primary key is define here because a professor can theoretically have multiple functions in one organization.
  • Possible PK = {professor_id, organization_id, function}
    • One could define the combination of all three attributes as the primary key in order to have some form of unique constraint in that table, but that would be a bit over the top.

Time to implement this

  • Since you already have a pre-populated affiliations table, things are not going to be so straightforward.
  • You'll need to link and migrate the data to a new table to implement this relationship.

Add foreign keys to the "affiliations" table

At the moment, the affiliations table has the structure {firstname, lastname, function, organization}, as you can see in the preview at the bottom right. In the next three exercises, you're going to turn this table into the form {professor_id, organization_id, function}, with professor_id and organization_id being foreign keys that point to the respective tables.

You're going to transform the affiliations table in-place, i.e., without creating a temporary table to cache your intermediate results.

Instructions 1/3

  • Add a professor_id column with integer data type to affiliations, and declare it to be a foreign key that references the id column in professors.
-- Add a professor_id column
ALTER TABLE affiliations
ADD COLUMN professor_id integer REFERENCES professors (id);
In [52]:
stmt = "\
ALTER TABLE uni.affiliations \
ADD COLUMN professor_id integer REFERENCES uni.professors (id); \
"
conn.execute(stmt)
Out[52]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0eb78e08>

Instructions 2/3

Rename the organization column in affiliations to organization_id.

-- Rename the organization column to organization_id
ALTER TABLE affiliations
RENAME COLUMN organization TO organization_id;
In [53]:
stmt = "\
ALTER TABLE uni.affiliations \
RENAME COLUMN organization TO organization_id; \
"
conn.execute(stmt)
Out[53]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e9e2688>

Instructions 3/3

Add a foreign key constraint on organization_id so that it references the id column in organizations.

-- Add a foreign key on organization_id
ALTER TABLE affiliations
ADD CONSTRAINT affiliations_organization_fkey FOREIGN KEY (organization_id) REFERENCES organizations (id);
In [54]:
stmt = "\
ALTER TABLE uni.affiliations \
ADD CONSTRAINT affiliations_organization_fkey FOREIGN KEY (organization_id) REFERENCES uni.organizations (id); \
"
conn.execute(stmt)
Out[54]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0e9e2608>

Making organization_id a foreign key worked flawlessly because these organizations actually exist in the organizations table. That was only the first part, though. Now it's time to update professor_id in affiliations – so that it correctly refers to the corresponding professors.

Populate the "professor_id" column

Now it's time to also populate professors_id. You'll take the ID directly from professors.

Here's a way to update columns of a table based on values in another table:

UPDATE table_a
SET column_to_update = table_b.column_to_update_from
FROM table_b
WHERE condition1 AND condition2 AND ...;

This query does the following:

  1. For each row in table_a, find the corresponding row in table_b where condition1, condition2, etc., are met.
  2. Set the value of column_to_update to the value of column_to_update_from (from that corresponding row).

The conditions usually compare other columns of both tables, e.g. table_a.some_column = table_b.some_column. Of course, this query only makes sense if there is only one matching row in table_b.

Instructions 1/3

  • First, have a look at the current state of affiliations by fetching 10 rows and all columns.
-- Have a look at the 10 first rows of affiliations
SELECT * FROM affiliations
LIMIT 10;
In [55]:
stmt = "\
SELECT * FROM uni.affiliations \
LIMIT 10; \
"
res = conn.execute(stmt)
res.fetchall()
Out[55]:
[('Dimos', 'Poulikakos', 'VR-Mandat', 'Scrona AG', None),
 ('Francesco', 'Stellacci', 'Co-editor in Chief, Nanoscale', 'Royal Chemistry Society, UK', None),
 ('Alexander', 'Fust', 'Fachexperte und Coach für Designer Startups', 'Creative Hub', None),
 ('Jürgen', 'Brugger', 'Proposal reviewing HEPIA', 'HES Campus Biotech, Genève', None),
 ('Hervé', 'Bourlard', 'Director', 'Idiap Research Institute', None),
 ('Ioannis', 'Papadopoulos', 'Mandat', 'Schweizerischer Nationalfonds (SNF)', None),
 ('Olaf', 'Blanke', 'Professeur à 20%', 'Université de Genève', None),
 ('Leo', 'Staub', 'Präsident Verwaltungsrat', 'Genossenschaft Migros Ostschweiz', None),
 ('Pascal', 'Pichonnaz', 'Vize-Präsident', 'EKK (Eidgenössische Konsumenten Kommission)', None),
 ('Martin G.', 'Täuber', None, 'SWAN Hadron AG', None)]

Instructions 2/3

  • Update the professor_id column with the corresponding value of the id column in professors.
  • "Corresponding" means rows in professors where the firstname and lastname are identical to the ones in affiliations.
-- Set professor_id to professors.id where firstname, lastname correspond to rows in professors
UPDATE affiliations
SET professor_id = professors.id
FROM professors
WHERE affiliations.firstname = professors.firstname AND affiliations.lastname = professors.lastname;
In [56]:
stmt = "\
UPDATE uni.affiliations \
SET professor_id = professors.id \
FROM uni.professors \
WHERE affiliations.firstname = professors.firstname AND affiliations.lastname = professors.lastname; \
"
conn.execute(stmt)
Out[56]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0ea2f708>

Instructions 3/3

  • Check out the first 10 rows and all columns of affiliations again. Have the professor_ids been correctly matched?
-- Have a look at the 10 first rows of affiliations again
SELECT * FROM affiliations
LIMIT 10;
In [57]:
stmt = "\
SELECT * FROM uni.affiliations \
LIMIT 10; \
"
res = conn.execute(stmt)
res.fetchall()
Out[57]:
[('Nicolas', 'Mermod', 'Mandat', 'Selexis', 286),
 ('Christoph', 'Aebi', None, 'Ronald McDonald Stiftung', 399),
 ('Peter', 'Schneemann', None, 'CIHA', 383),
 ('Pius Eliseo', 'Baschera', 'VR-Präsidium', 'Hilti AG', 389),
 ('Dimos', 'Poulikakos', 'VR-Mandat', 'Scrona AG', 419),
 ('Francesco', 'Stellacci', 'Co-editor in Chief, Nanoscale', 'Royal Chemistry Society, UK', 82),
 ('Alexander', 'Fust', 'Fachexperte und Coach für Designer Startups', 'Creative Hub', 514),
 ('Jürgen', 'Brugger', 'Proposal reviewing HEPIA', 'HES Campus Biotech, Genève', 350),
 ('Hervé', 'Bourlard', 'Director', 'Idiap Research Institute', 23),
 ('Ioannis', 'Papadopoulos', 'Mandat', 'Schweizerischer Nationalfonds (SNF)', 193)]

As you can see, the correct professors.id has been inserted into professor_id for each record, thanks to the matching firstname and lastname in both tables.

Drop "firstname" and "lastname"

The firstname and lastname columns of affiliations were used to establish a link to the professors table in the last exercise – so the appropriate professor IDs could be copied over. This only worked because there is exactly one corresponding professor for each row in affiliations. In other words: {firstname, lastname} is a candidate key of professors – a unique combination of columns.

It isn't one in affiliations though, because, as said in the video, professors can have more than one affiliation.

Because professors are referenced by professor_id now, the firstname and lastname columns are no longer needed, so it's time to drop them. After all, one of the goals of a database is to reduce redundancy where possible.

Instructions

  • Drop the firstname and lastname columns from the affiliations table.
-- Drop the firstname & lastname column
ALTER TABLE affiliations
DROP COLUMN firstname;

ALTER TABLE affiliations
DROP COLUMN lastname;
In [58]:
stmt = "\
ALTER TABLE uni.affiliations \
DROP COLUMN firstname, \
DROP COLUMN lastname; \
"
conn.execute(stmt)
Out[58]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0ea2ed88>

Now the affiliations table that models the N:M-relationship between professors and organizations is finally complete.

Referential integrity

  • A record referencing another table must refer to an existing record in that table
    • A record in table A can't point to a record in table B that doesn't exist.
  • Referential integrity is a constraint that always concerns two tables, and is enforced through foreign keys, as you've seen in the previous lessons of this chapter.
  • Enforced through foreign keys. If you define a foreign key in the table professors referencing the table universities, referential integrity is held from professors to universities.

referential integrity violations

  • Referential integrity from table A to table B is violated...
    • If a record in table B that is already referenced from table A is deleted, you have a violation.
    • If you try to insert a record in table A that refers to something that does not exist in table B, you also violate the principle.
    • The main reason for Foreign keys is they prevent violations - they throw errors and stop you from accidentally doing these things.

Dealing with violations

CREATE TABLE a (
 id integer PRIMARY KEY,
 column_a varchar(64), 
 ...,
 b_id integer REFERENCES b (id) ON DELETE NO ACTION
);
  • Throwing errors is not the only option.
  • If you specify a foreign key on a column, you can tell the the database system what should happen if an entry in the referenced table is deleted.
  • By default, the ON DELETE NO ACTION keyword is automatically appended to a foreign key definition.
    • If you try to delete a record in table B, which is reference from table A, the system will throw an error.

Other options

CREATE TABLE a (
 id integer PRIMARY KEY,
 column_a varchar(64), 
 ...,
 b_id integer REFERENCES b (id) ON DELETE CASCADE
);
  1. There's the CASCADE option, which will first allow the deletion of the record in table B, and then will automatically delete all referencing records in table A. That deletion is cascaded and will delete all referencing records.
  2. The RESTRICT option is almost identical to the NO ACTION option. If will throw an error.
  3. The SET NULL option. It will set the value of the foreign key for this record to NULL.
  4. The SET DEFAULT option sets the referencing column to its default value if the referenced record is deleted and only works if a default value for the column has been specified.
  • Let's practice by changing the referential integrity behavior of the database.

Referential integrity violations

Given the current state of your database, what happens if you execute the following SQL statement?

DELETE FROM universities WHERE id = 'EPF';

Possible Answers

  • It throws an error because the university with ID "EPF" does not exist.
  • The university with ID "EPF" is deleted.
  • It fails because referential integrity from universities to professors is violated.
  • It fails because referential integrity from professors to universities is violated.

You defined a foreign key on professors.university_id that references universities.id, so referential integrity is said to hold from professors to universities.

Change the referential integrity behavior of a key

So far, you implemented three foreign key constraints:

  1. professors.university_id to universities.id
  2. affiliations.organization_id to organizations.id
  3. affiliations.professor_id to professors.id

These foreign keys currently have the behavior ON DELETE NO ACTION. Here, you're going to change that behavior for the column referencing organizations from affiliations. If an organization is deleted, all its affiliations (by any professor) should also be deleted.

Altering a key constraint doesn't work with ALTER COLUMN. Instead, you have to delete the key constraint and then add a new one with a different ON DELETE behavior.

For deleting constraints, though, you need to know their name. This information is also stored in information_schema.

Instructions 1/4

  • Have a look at the existing foreign key constraints by querying table_constraints in information_schema.
-- Identify the correct constraint name
SELECT constraint_name, table_name, constraint_type
FROM information_schema.table_constraints
WHERE constraint_type = 'FOREIGN KEY';
In [59]:
stmt = "\
SELECT constraint_name, table_name, constraint_type \
FROM information_schema.table_constraints \
WHERE constraint_type = 'FOREIGN KEY'; \
"
res = conn.execute(stmt)
res.fetchall()
Out[59]:
[('professors_fkey', 'professors', 'FOREIGN KEY'),
 ('affiliations_professor_id_fkey', 'affiliations', 'FOREIGN KEY'),
 ('affiliations_organization_fkey', 'affiliations', 'FOREIGN KEY')]

Instructions 2/4

  • Delete the affiliations_organization_id_fkey foreign key constraint in affiliations.
-- Drop the right foreign key constraint
ALTER TABLE affiliations
DROP CONSTRAINT affiliations_organization_id_fkey;
In [60]:
stmt = "\
ALTER TABLE uni.affiliations \
DROP CONSTRAINT affiliations_organization_fkey; \
"
conn.execute(stmt)
Out[60]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0eb7d108>

Instructions 3/4

  • Add a new foreign key to affiliations that cascades deletion if a referenced record is deleted from organizations. Name it affiliations_organization_id_fkey.
-- Add a new foreign key constraint from affiliations to organizations which cascades deletion
ALTER TABLE affiliations
ADD CONSTRAINT affiliations_organization_id_fkey FOREIGN KEY (organization_id) REFERENCES organizations (id) ON DELETE CASCADE;
In [61]:
stmt = "\
ALTER TABLE uni.affiliations \
ADD CONSTRAINT affiliations_organization_fkey FOREIGN KEY (organization_id) REFERENCES uni.organizations (id) ON DELETE CASCADE; \
"
conn.execute(stmt)
Out[61]:
<sqlalchemy.engine.result.ResultProxy at 0x2ed0eb7d088>

Instructions 4/4

  • Run the DELETE and SELECT queries to double check that the deletion cascade actually works.
-- Delete an organization 
DELETE FROM organizations 
WHERE id = 'CUREM';

-- Check that no more affiliations with this organization exist
SELECT * FROM affiliations
WHERE organization_id = 'CUREM';
In [62]:
stmt = "\
DELETE FROM uni.organizations \
WHERE id = 'CUREM'; \
"
conn.execute(stmt)

stmt = "\
SELECT * FROM uni.affiliations \
WHERE organization_id = 'CUREM'; \
"
res = conn.execute(stmt)
res.fetchall()
Out[62]:
[]

As you can see, whenever an organization referenced by an affiliation is deleted, the affiliations also gets deleted. It is your job as database designer to judge whether this is a sensible configuration. Sometimes, setting values to NULL or to restrict deletion altogether might make more sense!

Roundup

How the database has been transformed

initial_db

  • Started with a simple table with a lot of redundancy.
  • You might be used to such tables when working with flat files like CSVs or Excel files.
  • The table has undergone a step-by-step transformation into the following database schema - only by executing SQL queries.

final_model_2

  • Column data types have been defined, primary and foreign keys added, and through that, specified relationships between tables.
  • All these measures will guarantee better data consistency and therefore quality.
  • This is especially helpful if you add new data to the database, but also makes analyzing the data easier.

The database ecosystem

ecosystem

  • To go further from here, it's useful to look at the bigger picture for a minute.
  • In this course, a database has been transformed.
    • That was done with PostgreSQL, which is also call a Database Management System or DBMS.
    • The DBMS and one or more databases together form the Database System.
    • The DBMS exposes a query interface where you can run ad-hoc analyses and queries with SQL.
    • You can also access this interface through other client applications.
    • You could, for example, program a Python script that connects to the database, loads data from it, and visualizes it.
  • In the remainder of this course, you'll no longer manipulate data in your database system, but employ some analysis queries on your database.
    • This will be a quick repetition of what you've learned in previous SQL courses such as Into to SQL for Data Science, but it will also demonstrate the advantages of having a database instead of a flat file in the first place.

Count affiliations per university

Now that your data is ready for analysis, let's run some exemplary SQL queries on the database. You'll now use already known concepts such as grouping by columns and joining tables.

In this exercise, you will find out which university has the most affiliations (through its professors). For that, you need both affiliations and professors tables, as the latter also holds the university_id.

As a quick repetition, remember that joins have the following structure:

SELECT table_a.column1, table_a.column2, table_b.column1, ... 
FROM table_a
JOIN table_b 
ON table_a.column = table_b.column

This results in a combination of table_a and table_b, but only with rows where table_a.column is equal to table_b.column.

Instructions

  • Count the number of total affiliations by university.
  • Sort the result by that count, in descending order.
-- Count the total number of affiliations per university
SELECT COUNT(*), professors.university_id 
FROM affiliations
JOIN professors
ON affiliations.professor_id = professors.id
-- Group by the ids of professors
GROUP BY professors.id 
ORDER BY count DESC;
In [63]:
stmt = "\
SELECT COUNT(*), professors.university_id \
FROM uni.affiliations \
JOIN uni.professors \
ON affiliations.professor_id = professors.id \
GROUP BY professors.university_id \
ORDER BY count DESC LIMIT 10; \
"
res = conn.execute(stmt)
res.fetchall()
Out[63]:
[(579, 'EPF'),
 (272, 'USG'),
 (162, 'UBE'),
 (133, 'ETH'),
 (75, 'UBA'),
 (40, 'UFR'),
 (36, 'UNE'),
 (35, 'ULA'),
 (33, 'UGE'),
 (7, 'UZH')]

As you can see, the count of affiliations is completely different from university to university.

Join all the tables together

In this last exercise, you will find the university city of the professor with the most affiliations in the sector "Media & communication".

For this, you need to join all the tables, group by a column, and then use selection criteria to get only the rows in the correct sector.

Instructions 1/3

  • Join all tables in the database (starting with affiliations, professors, organizations, and universities) and look at the result.
-- Join all tables
SELECT *
FROM affiliations
JOIN professors
ON affiliations.professor_id = professors.id
JOIN organizations
ON affiliations.organization_id = organizations.id
JOIN universities
ON professors.university_id = universities.id;
In [64]:
stmt = "\
SELECT * \
FROM uni.affiliations \
JOIN uni.professors \
ON affiliations.professor_id = professors.id \
JOIN uni.organizations \
ON affiliations.organization_id = organizations.id \
JOIN uni.universities \
ON professors.university_id = universities.id \
LIMIT 10; \
"
res = conn.execute(stmt)
res.fetchall()
Out[64]:
[('Mandat', 'Selexis', 286, 'Nicolas', 'Mermod', 'ULA', 286, 'Selexis', 'Pharma & health', 'ULA', 'Uni Lausanne', 'Lausanne'),
 (None, 'Ronald McDonald Stiftung', 399, 'Christoph', 'Aebi', 'UBE', 399, 'Ronald McDonald Stiftung', 'Society, Social, Culture & Sports', 'UBE', 'Uni Bern', 'Bern'),
 (None, 'CIHA', 383, 'Peter', 'Schneemann', 'UBE', 383, 'CIHA', 'Not classifiable', 'UBE', 'Uni Bern', 'Bern'),
 ('VR-Präsidium', 'Hilti AG', 389, 'Pius Eliseo', 'Baschera', 'ETH', 389, 'Hilti AG', 'Industry, construction & agriculture', 'ETH', 'ETH Zürich', 'Zurich'),
 ('VR-Mandat', 'Scrona AG', 419, 'Dimos', 'Poulikakos', 'ETH', 419, 'Scrona AG', 'Industry, construction & agriculture', 'ETH', 'ETH Zürich', 'Zurich'),
 ('Co-editor in Chief, Nanoscale', 'Royal Chemistry Society, UK', 82, 'Francesco', 'Stellacci', 'EPF', 82, 'Royal Chemistry Society, UK', 'Education & research', 'EPF', 'ETH Lausanne', 'Lausanne'),
 ('Fachexperte und Coach für Designer Startups', 'Creative Hub', 514, 'Alexander', 'Fust', 'USG', 514, 'Creative Hub', 'Commerce, trade & services', 'USG', 'Uni St. Gallen', 'Saint Gallen'),
 ('Proposal reviewing HEPIA', 'HES Campus Biotech, Genève', 350, 'Jürgen', 'Brugger', 'EPF', 350, 'HES Campus Biotech, Genève', 'Education & research', 'EPF', 'ETH Lausanne', 'Lausanne'),
 ('Director', 'Idiap Research Institute', 23, 'Hervé', 'Bourlard', 'EPF', 23, 'Idiap Research Institute', 'Education & research', 'EPF', 'ETH Lausanne', 'Lausanne'),
 ('Mandat', 'Schweizerischer Nationalfonds (SNF)', 193, 'Ioannis', 'Papadopoulos', 'ULA', 193, 'Schweizerischer Nationalfonds (SNF)', 'Education & research', 'ULA', 'Uni Lausanne', 'Lausanne')]
In [65]:
res = conn.execute(stmt).fetchall()
# Create a DataFrame from the results: df
df = pd.DataFrame(res)

# Set column names
df.columns = res[0].keys()

df.head()
Out[65]:
function organization_id professor_id firstname lastname university_id id id organization_sector id university university_city
0 Mandat Selexis 286 Nicolas Mermod ULA 286 Selexis Pharma & health ULA Uni Lausanne Lausanne
1 None Ronald McDonald Stiftung 399 Christoph Aebi UBE 399 Ronald McDonald Stiftung Society, Social, Culture & Sports UBE Uni Bern Bern
2 None CIHA 383 Peter Schneemann UBE 383 CIHA Not classifiable UBE Uni Bern Bern
3 VR-Präsidium Hilti AG 389 Pius Eliseo Baschera ETH 389 Hilti AG Industry, construction & agriculture ETH ETH Zürich Zurich
4 VR-Mandat Scrona AG 419 Dimos Poulikakos ETH 419 Scrona AG Industry, construction & agriculture ETH ETH Zürich Zurich

Instructions 2/3

  • Now group the result by organization sector, professor, and university city.
  • Count the resulting number of rows.
-- Group the table by organization sector, professor and university city
SELECT COUNT(*), organizations.organization_sector, professors.id, universities.university_city
FROM affiliations
JOIN professors
ON affiliations.professor_id = professors.id
JOIN organizations
ON affiliations.organization_id = organizations.id
JOIN universities
ON professors.university_id = universities.id
GROUP BY organizations.organization_sector, professors.id, universities.university_city;
In [66]:
stmt = "\
SELECT COUNT(*), organizations.organization_sector, professors.id, universities.university_city \
FROM uni.affiliations \
JOIN uni.professors \
ON affiliations.professor_id = professors.id \
JOIN uni.organizations \
ON affiliations.organization_id = organizations.id \
JOIN uni.universities \
ON professors.university_id = universities.id \
GROUP BY organizations.organization_sector, professors.id, universities.university_city \
LIMIT 10; \
"

res = conn.execute(stmt).fetchall()
df = pd.DataFrame(res)
df.columns = res[0].keys()
df
Out[66]:
count organization_sector id university_city
0 3 Politics, administration, justice system & sec... 502 Lausanne
1 1 Technology 439 Lausanne
2 3 Education & research 253 Saint Gallen
3 2 Education & research 165 Lausanne
4 1 Financial industry & insurances 523 Bern
5 1 Consulting, public relations, legal & trust 210 Lausanne
6 4 Education & research 356 Lausanne
7 1 Education & research 520 Lausanne
8 1 Education & research 168 Zurich
9 1 Consulting, public relations, legal & trust 287 Lausanne

Instructions 3/3

  • Only retain rows with "Media & communication" as organization sector, and sort the table by count, in descending order.
-- Filter the table and sort it
SELECT COUNT(*), organizations.organization_sector, 
professors.id, universities.university_city
FROM affiliations
JOIN professors
ON affiliations.professor_id = professors.id
JOIN organizations
ON affiliations.organization_id = organizations.id
JOIN universities
ON professors.university_id = universities.id
WHERE organizations.organization_sector = 'Media & communication'
GROUP BY organizations.organization_sector, 
professors.id, universities.university_city
ORDER BY count DESC;
In [67]:
stmt = "\
SELECT COUNT(*), organizations.organization_sector, professors.id, universities.university_city \
FROM uni.affiliations \
JOIN uni.professors \
ON affiliations.professor_id = professors.id \
JOIN uni.organizations \
ON affiliations.organization_id = organizations.id \
JOIN uni.universities \
ON professors.university_id = universities.id \
WHERE organizations.organization_sector = 'Media & communication' \
GROUP BY organizations.organization_sector, professors.id, universities.university_city \
ORDER BY count DESC LIMIT 10; \
"

res = conn.execute(stmt).fetchall()
df = pd.DataFrame(res)
df.columns = res[0].keys()
df
Out[67]:
count organization_sector id university_city
0 4 Media & communication 345 Lausanne
1 3 Media & communication 253 Saint Gallen
2 3 Media & communication 322 Lausanne
3 2 Media & communication 378 Lausanne
4 2 Media & communication 453 Saint Gallen
5 2 Media & communication 452 Zurich
6 2 Media & communication 380 Saint Gallen
7 2 Media & communication 439 Lausanne
8 2 Media & communication 290 Saint Gallen
9 1 Media & communication 334 Basel

The professor with id 538 (different for the downloaded dataset) has the most affiliations in the "Media & communication" sector, and he or she lives in the city of Lausanne. Thanks to your database design, you can be sure that the data you've just queried is consistent. Of course, you could also put university_city and organization_sector in their own tables, making the data model even more formal. However, in database design, you have to strike a balance between modeling overhead, desired data consistency, and usability for queries like the one you've just wrote. Congratulations, you made it to the end!

In [68]:
conn.close()

Certificate