- Course: DataCamp: Introduction to Databases in Python
- This notebook was created as a reproducible reference.
- Most of the material is from the course, however, code updates have been introduced for compatibility with
sqlalchemy v2.0.29
. - I completed the exercises
- If you find the content beneficial, consider a DataCamp Subscription.
- Test in:
- pandas version: 2.2.1
- matplotlib version: 3.8.4
- SQLAlchemy version: 2.0.29
- PyMySQL version: 1.0.2
1
2
3
4
5
6
7
8
| import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine, Table, MetaData, select, and_, desc, func, case, cast, Float
from sqlalchemy import Column, String, Integer, Boolean, insert, update, delete, inspect, text
from sqlalchemy.orm import Session
import pymysql
from pprint import pprint as pp
import csv
|
1
2
3
| pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 300)
pd.set_option('display.expand_frame_repr', True)
|
Data Files Location
Data File Objects
1
2
3
| census_csv_data = 'data/intro_to_databases_in_python/census.csv'
census_sql_data = 'sqlite:///data/intro_to_databases_in_python/census.sqlite'
employees_sql_data = 'sqlite:///data/intro_to_databases_in_python/employees.sqlite'
|
Introduction to Databases in Python
Summary
- Introduction
- Discuss the role of database management in modern software development.
- Introduce SQLAlchemy as a powerful tool for database interactions in Python.
- Setting Up Your Environment
- Explanation of how to set up a local SQLite database for development and testing purposes.
- Core Functionalities of SQLAlchemy
- Detailed instructions on creating an engine to connect to various types of databases.
- Examples of executing raw SQL statements through SQLAlchemy to perform database operations.
- Using SQLAlchemy ORM (Object-Relational Mapping)
- Definition and benefits of using ORM over traditional SQL queries.
- Guide to defining database schemas with Python classes and mapping them to database tables.
- Techniques for manipulating data in tables using ORM, including adding, deleting, and updating records.
- Session Management and Transactions
- Explanation of session management in SQLAlchemy for handling database operations.
- Detailed process for adding records to the database, committing transactions to save changes, and rolling back transactions in case of errors.
- Querying Data
- Introduction to constructing queries using SQLAlchemy ORM, focusing on simplicity and power.
- Advanced querying techniques, including joining tables, filtering results, and using subqueries for complex data retrieval.
- Integrating SQLAlchemy with SQLite
- Configuration steps to integrate SQLAlchemy with SQLite, a lightweight relational database.
- Practical examples and code snippets for performing create, read, update, and delete (CRUD) operations in SQLite using SQLAlchemy.
Course Description
In this Python SQL course, you’ll learn the basics of using Structured Query Language (SQL) with Python. This will be useful since whether you like it or not, databases are ubiquitous and, as a data scientist, you’ll need to interact with them constantly. The Python SQL toolkit SQLAlchemy provides an accessible and intuitive way to query, build & write to SQLite, MySQL and Postgresql databases (among many others), all of which you will encounter in the daily life of a data scientist.
Basics of Relational Databases
In this chapter, you will become acquainted with the fundamentals of Relational Databases and the Relational Model. You will learn how to connect to a database and then interact with it by writing basic SQL queries, both in raw SQL as well as with SQLAlchemy, which provides a Pythonic way of interacting with databases.
Introduction to Databases
A database consists of tables
Table consists of columns and rows
Exercises
Relational Model
Which of the following is not part of the relational model?
Answer the question
- Tables
- Columns
- Rows
- Dimensions
- Relationships
Connecting to a Database
Meet SQLAlchemy
- Two Main Pieces
- Core (Relational Model focused)
- ORM (User Data Model focused)
There are many types of databases
- SQLite
- PostgreSQL
- MySQL
- MS SQL
- Oracle
- Many more
Connecting to a database
1
2
3
| from sqlalchemy import create_engine
engine = create_engine('sqlite:///census_nyc.sqlite')
connection = engine.connect()
|
- Engine: common interface to the database from SQLAlchemy
- Connection string: All the details required to find the database (and login, if necessary)
A word on connection strings
- ‘sqlite:///census_nyc.sqlite’
- Driver+Dialect Filename
What’s in your database?
- Before querying your database, you’ll want to know what is in it: what the tables are, for example:
1
2
3
4
| from sqlalchemy import create_engine
engine = create_engine('sqlite:///census_nyc.sqlite')
print(engine.table_names())
Out: ['census', 'state_fact']
|
Reflection
- Reflection reads database and builds SQLAlchemy Table objects
1
2
3
4
5
6
7
8
9
10
11
| from sqlalchemy import MetaData, Table
metadata = MetaData()
census = Table('census', metadata, autoload=True, autoload_with=engine)
print(repr(census))
Out:
Table('census', MetaData(bind=None), Column('state',
VARCHAR(length=30), table=<census>), Column('sex',
VARCHAR(length=1), table=<census>), Column('age', INTEGER(),
table=<census>), Column('pop2000', INTEGER(), table=<census>),
Column('pop2008', INTEGER(), table=<census>), schema=None)
|
Exercises
Engines and Connection Strings
Alright, it’s time to create your first engine! An engine is just a common interface to a database, and the information it requires to connect to one is contained in a connection string, such as sqlite:///census_nyc.sqlite. Here, sqlite is the database driver, while census_nyc.sqlite is a SQLite file contained in the local directory.
You can learn a lot more about connection strings in the SQLAlchemy documentation.
Your job in this exercise is to create an engine that connects to a local SQLite file named census.sqlite. Then, print the names of the tables it contains using the .table_names() method. Note that when you just want to print the table names, you do not need to use engine.connect() after creating the engine.
Instructions
- Import create_engine from the sqlalchemy module.
- Using the create_engine() function, create an engine for a local file named census.sqlite with sqlite as the driver. Be sure to enclose the connection string within quotation marks.
- Print the output from the .table_names() method on the engine.
1
2
3
4
5
6
7
8
9
10
11
| # Import create_engine - at top of notebook
# Create an engine that connects to the census.sqlite file: engine
engine = create_engine(census_sql_data)
# Create an inspector object
inspector = inspect(engine)
# Use the inspector to list the table names
table_names = inspector.get_table_names()
print(table_names)
|
1
| ['census', 'state_fact']
|
Autoloading Tables from a Database
SQLAlchemy can be used to automatically load tables from a database using something called reflection. Reflection is the process of reading the database and building the metadata based on that information. It’s the opposite of creating a Table by hand and is very useful for working with existing databases. To perform reflection, you need to import the Table object from the SQLAlchemy package. Then, you use this Table object to read your table from the engine and autoload the columns. Using the Table object in this manner is a lot like passing arguments to a function. For example, to autoload the columns with the engine, you have to specify the keyword arguments autoload=True and autoload_with=engine to Table().
In this exercise, your job is to reflect the census table available on your engine into a variable called census. The metadata has already been loaded for you using MetaData() and is available in the variable metadata.
Instructions
- Import the Table object from sqlalchemy.
- Reflect the census table by using the Table object with the arguments:
- The name of the table as a string (‘census’).
- The metadata, contained in the variable metadata.
- autoload=True
- The engine to autoload with - in this case, engine.
- Print the details of census using the repr() function.
2024-04-19 Update Notes
The error you’re seeing in the updated trace is due to the use of deprecated arguments with the Table
constructor in SQLAlchemy version 1.4 and later. The autoload
and autoload_with
parameters have been replaced by a different mechanism for reflecting tables from a database.
To fix the error, you should use the autoload_with
parameter in a different way along with SQLAlchemy’s reflection system.
This code first reflects all tables from the connected database into the metadata object with metadata.reflect(engine)
. Then, it retrieves a specific table (in this case, ‘census’) without using the now-deprecated autoload
parameter. This approach uses the recommended pattern for SQLAlchemy version 1.4 and later, providing compatibility and future-proofing your code.
1
2
3
4
5
6
7
8
9
10
11
| # Import Table - at top of Notebook
# Create a MetaData instance
metadata = MetaData()
# Reflect an existing table by using the Table constructor and MetaData.reflect
metadata.reflect(engine)
census = Table('census', metadata, autoload_with=engine)
# Now you can print out the structure of the table
print(repr(census))
|
1
| Table('census', MetaData(), Column('state', VARCHAR(length=30), table=<census>), Column('sex', VARCHAR(length=1), table=<census>), Column('age', INTEGER(), table=<census>), Column('pop2000', INTEGER(), table=<census>), Column('pop2008', INTEGER(), table=<census>), schema=None)
|
Viewing Table Details
Great job reflecting the census table! Now you can begin to learn more about the columns and structure of your table. It is important to get an understanding of your database by examining the column names. This can be done by using the .columns attribute and accessing the .keys() method. For example, census.columns.keys() would return a list of column names of the census table.
Following this, we can use the metadata container to find out more details about the reflected table such as the columns and their types. For example, table objects are stored in the metadata.tables dictionary, so you can get the metadata of your census table with metadata.tables[‘census’]. This is similar to your use of the repr() function on the census table from the previous exercise.
Instructions
- Reflect the census table as you did in the previous exercise using the Table() function.
- Print a list of column names of the census table by applying the .keys() method to census.columns.
- Print the details of the census table using the metadata.tables dictionary along with the repr() function. To do this, first access the ‘census’ key of the metadata.tables dictionary, and place this inside the provided repr() function.
1
2
3
4
5
| # Reflect the census table from the engine: census
census = Table('census', metadata, autoload=True, autoload_with=engine)
# Print the column names
census.columns.keys()
|
1
| ['state', 'sex', 'age', 'pop2000', 'pop2008']
|
1
2
| # Print full table metadata
repr(metadata.tables['census'])
|
1
| "Table('census', MetaData(), Column('state', VARCHAR(length=30), table=<census>), Column('sex', VARCHAR(length=1), table=<census>), Column('age', INTEGER(), table=<census>), Column('pop2000', INTEGER(), table=<census>), Column('pop2008', INTEGER(), table=<census>), schema=None)"
|
Introduction to SQL
SQL Statements
- Select, Insert, Update & Delete data
- Create & Alter data
Basic SQL querying
1
2
3
| ● SELECT column_name FROM table_name
● SELECT pop2008 FROM People
● SELECT * FROM People
|
Basic SQL querying
1
2
3
4
5
6
| from sqlalchemy import create_engine
engine = create_engine('sqlite:///census_nyc.sqlite')
connection = engine.connect()
stmt = 'SELECT * FROM people'
result_proxy = connection.execute(stmt)
results = result_proxy.fetchall()
|
ResultProxy vs ResultSet
1
2
| In [5]: result_proxy = connection.execute(stmt)
In [6]: results = result_proxy.fetchall()
|
Handling ResultSets
1
2
3
4
5
6
7
8
9
| first_row = results[0]
print(first_row)
Out: ('Illinois', 'M', 0, 89600, 95012)
print(first_row.keys())
Out: ['state', 'sex', 'age', 'pop2000', 'pop2008']
print(first_row.state)
Out: 'Illinois'
|
SQLAlchemy to Build Queries
- Provides a Pythonic way to build SQL statements
- Hides differences between backend database types
SQLAlchemy querying
1
2
3
4
5
| from sqlalchemy import Table, MetaData
metadata = MetaData()
census = Table('census', metadata, autoload=True, autoload_with=engine)
stmt = select(census)
results = connection.execute(stmt).fetchall()
|
SQLAlchemy Select Statement
- Requires a list of one or more Tables or Columns
- Using a table will select all the columns in it
1
2
3
| stmt = select(census)
print(stmt)
Out: 'SELECT * from CENSUS'
|
Exercises
Selecting data from a Table: raw SQL
Using what we just learned about SQL and applying the .execute() method on our connection, we can leverage a raw SQL query to query all the records in our census table. The object returned by the .execute() method is a ResultProxy. On this ResultProxy, we can then use the .fetchall() method to get our results - that is, the ResultSet.
In this exercise, you’ll use a traditional SQL query. In the next exercise, you’ll move to SQLAlchemy and begin to understand its advantages. Go for it!
Instructions
- Build a SQL statement to query all the columns from census and store it in stmt. Note that your SQL statement must be a string.
- Use the .execute() and .fetchall() methods on connection and store the result in results. Remember that .execute() comes before .fetchall() and that stmt needs to be passed to .execute().
- Print results.
2024-04-22 Update Notes
SQLAlchemy in its later versions (especially from version 1.4 onwards) requires that you wrap raw SQL strings in a text()
construct to safely execute them. This is a part of SQLAlchemy’s security measures to help prevent SQL injection attacks.
In this revised code, text()
is imported from sqlalchemy
and is used to wrap the SQL statement. This ensures that SQLAlchemy treats the string as an SQL expression, allowing it to be executed on the database. This method is safe and recommended when executing raw SQL queries with SQLAlchemy.
1
2
| engine = create_engine(census_sql_data)
connection = engine.connect()
|
1
2
3
4
5
6
| # Build select statement for census table: stmt
stmt = text('SELECT * FROM census')
# Execute the statement and fetch the results: results
results = connection.execute(stmt).fetchall()
results[:5]
|
1
2
3
4
5
| [('Illinois', 'M', 0, 89600, 95012),
('Illinois', 'M', 1, 88445, 91829),
('Illinois', 'M', 2, 88729, 89547),
('Illinois', 'M', 3, 88868, 90037),
('Illinois', 'M', 4, 91947, 91111)]
|
Selecting data from a Table with SQLAlchemy
It’s now time to build your first select statement using SQLAlchemy. SQLAlchemy provides a nice “Pythonic” way of interacting with databases. So rather than dealing with the differences between specific dialects of traditional SQL such as MySQL or PostgreSQL, you can leverage the Pythonic framework of SQLAlchemy to streamline your workflow and more efficiently query your data. For this reason, it is worth learning even if you may already be familiar with traditional SQL.
In this exercise, you’ll once again build a statement to query all records from the census table. This time, however, you’ll make use of the select() function of the sqlalchemy module. This function requires a list of tables or columns as the only required argument.
Table and MetaData have already been imported. The metadata is available as metadata and the connection to the database as connection.
Instructions
- Import select from the sqlalchemy module.
- Reflect the census table. This code is already written for you.
- Create a query using the select() function to retrieve the census table. To do so, pass a list to select() containing a single element: census.
- Print stmt to see the actual SQL query being created. This code has been written for you.
- Using the provided print() function, print all the records from the census table. To do this:
- Use the .execute() method on connection with stmt as the argument to retrieve the ResultProxy.
- Use .fetchall() on connection.execute(stmt) to retrieve the ResultSet.
2024-04-22 Update Note
The select
function from SQLAlchemy no longer accepts a list of tables or columns as an argument inside a list. This syntax change was introduced in SQLAlchemy 1.4 to enforce more explicit construction patterns and improve security against SQL injection attacks.
This change reflects the updated usage of the select
function where you pass the table or columns directly, without enclosing them in brackets. This update in your code will conform to the latest SQLAlchemy syntax and should run without errors.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # Import select - at top of Notebook
engine = create_engine(census_sql_data)
connection = engine.connect()
metadata = MetaData()
metadata.reflect(engine)
# Access the 'census' table
census = metadata.tables['census']
# Build select statement for census table: stmt
stmt = select(census)
# Print the emitted statement to see the SQL emitted
print(stmt)
|
1
2
| SELECT census.state, census.sex, census.age, census.pop2000, census.pop2008
FROM census
|
1
2
3
| # Execute the statement and print the results
results = connection.execute(stmt).fetchall()
results[:5]
|
1
2
3
4
5
| [('Illinois', 'M', 0, 89600, 95012),
('Illinois', 'M', 1, 88445, 91829),
('Illinois', 'M', 2, 88729, 89547),
('Illinois', 'M', 3, 88868, 90037),
('Illinois', 'M', 4, 91947, 91111)]
|
Handling a ResultSet
Recall the differences between a ResultProxy and a ResultSet:
- ResultProxy: The object returned by the .execute() method. It can be used in a variety of ways to get the data returned by the query.
- ResultSet: The actual data asked for in the query when using a fetch method such as .fetchall() on a ResultProxy.
This separation between the ResultSet and ResultProxy allows us to fetch as much or as little data as we desire.
Once we have a ResultSet, we can use Python to access all the data within it by column name and by list style indexes. For example, you can get the first row of the results by using results[0]. With that first row then assigned to a variable first_row, you can get data from the first column by either using first_row[0] or by column name such as first_row[‘column_name’]. You’ll now practice exactly this using the ResultSet you obtained from the census table in the previous exercise. It is stored in the variable results. Enjoy!
Instructions
- Extract the first row of results and assign it to the variable first_row.
- Print the value of the first column in first_row.
- Print the value of the ‘state’ column in first_row.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Get the first row of the results by using an index: first_row
first_row = results[0]
# Print the first row of the results
print(first_row)
# Print the first column of the first row by using an index
print(first_row[0])
# Print the 'state' column of the first row by using its name
# print(first_row['state']) # this does not work
# attribute access does work
print(first_row.state)
|
1
2
3
| ('Illinois', 'M', 0, 89600, 95012)
Illinois
Illinois
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| # Create an SQLAlchemy Engine instance. This engine manages connections to the database.
# The `census_sql_data` should be a database URL that specifies database dialect and connection arguments.
engine = create_engine(census_sql_data)
# Create a session. This session establishes and maintains all conversations with the database.
# It represents a 'holding zone' for all the objects which you've loaded or associated with it during its lifespan.
session = Session(bind=engine)
# Create a MetaData instance. MetaData is a container object that keeps together many different features of a database (or multiple databases).
metadata = MetaData()
# Reflect the database schema into MetaData. This loads table definitions from the database automatically.
# The `bind=engine` argument tells MetaData which engine to use for connection.
metadata.reflect(bind=engine)
# Access the 'census' table object from the metadata. This dictionary-style access allows you to get a Table object.
# Each Table object is a member of the MetaData collection.
census = metadata.tables['census']
# Build a SELECT statement. `select(census)` constructs a simple query that selects all columns from the 'census' table.
stmt = select(census)
# Execute the SELECT statement using the session. This sends the SQL statement to the database and returns a result object.
result = session.execute(stmt)
# Fetch the first row of the result. `fetchone()` retrieves the next row of a query result set, returning a single sequence, or None if no more rows are available.
first_row = result.fetchone()
|
1
2
3
4
5
6
7
8
9
10
11
| # Print the type of the first row to verify it's an instance of `Row`, which allows both indexed and keyed access.
print(type(first_row))
# Print the column names available in the result. This helps verify the structure of the returned rows and what columns can be accessed.
print(result.keys())
# Print the first row to see the data that has been fetched.
print(first_row)
# Print the value of the 'state' column accessed using attribute access. This is an alternative access method provided by SQLAlchemy for convenience.
print(first_row.state)
|
1
2
3
4
| <class 'sqlalchemy.engine.row.Row'>
RMKeyView(['state', 'sex', 'age', 'pop2000', 'pop2008'])
('Illinois', 'M', 0, 89600, 95012)
Illinois
|
Coming up Next…
- Beef up your SQL querying skills
- Learn how to extract all types of useful information from your databases using SQLAlchemy
- Learn how to crete and write to relational databases
- Deep dive into the US census dataset
Applying Filtering, Ordering and Grouping to Queries
In this chapter, you will build on the database knowledge you began acquiring in the previous chapter by writing more nuanced queries that allow you to filter, order, and count your data, all within the Pythonic framework provided by SQLAlchemy!
Filtering and Targeted Data
Where Clauses
1
2
3
4
5
6
7
8
9
10
11
12
13
| stmt = select(census)
stmt = stmt.where(census.columns.state == 'California')
results = connection.execute(stmt).fetchall()
for result in results:
print(result.state, result.age)
Out:
California 0
California 1
California 2
California 3
California 4
Calif
|
- Restrict data returned by a query based on boolean conditions
- Compare a column against a value or another column
- Often used comparisons: ‘==’, ‘<=’, ‘>=’, or ‘!=’
Expressions
- Provide more complex conditions than simple operators
- Eg. in_(), like(), between()
- Many more in documentation
- Available as method on a Column
1
2
3
4
5
6
7
8
9
10
| stmt = select(census)
stmt = stmt.where(census.columns.state.startswith('New'))
for result in connection.execute(stmt):
print(result.state, result.pop2000)
Out:
New Jersey 56983
New Jersey 56686
New Jersey 57011
...
|
Conjunctions
- Allow us to have multiple criteria in a where clause
- Eg. and_(), not_(), or_()
1
2
3
4
5
6
7
8
9
10
11
| from sqlalchemy import or_
stmt = select(census)
stmt = stmt.where(or_(census.columns.state == 'California',
census.columns.state == 'New York'))
for result in connection.execute(stmt):
print(result.state, result.sex)
Out:
New York M
…
California F
|
Exercises
Connecting to a PostgreSQL Database
In these exercises, you will be working with real databases hosted on the cloud via Amazon Web Services (AWS)!
Let’s begin by connecting to a PostgreSQL database. When connecting to a PostgreSQL database, many prefer to use the psycopg2 database driver as it supports practically all of PostgreSQL’s features efficiently and is the standard dialect for PostgreSQL in SQLAlchemy.
You might recall from Chapter 1 that we use the create_engine() function and a connection string to connect to a database.
There are three components to the connection string in this exercise: the dialect and driver (‘postgresql+psycopg2://’), followed by the username and password (‘student:datacamp’), followed by the host and port (‘@postgresql.csrrinzqubik.us-east-1.rds.amazonaws.com:5432/’), and finally, the database name (‘census’). You will have to pass this string as an argument to create_engine() in order to connect to the database.
Instructions
- Import create_engine from sqlalchemy.
- Create an engine to the census database by concatenating the following strings:
- ‘postgresql+psycopg2://’
- ‘student:datacamp’
- ‘@postgresql.csrrinzqubik.us-east-1.rds.amazonaws.com’
- ‘:5432/census’
- Use the .table_names() method on engine to print the table names.
1
2
3
4
5
6
7
| # Use with local file
engine = create_engine(census_sql_data)
connection = engine.connect()
metadata = MetaData()
metadata.reflect(bind=engine)
census = metadata.tables['census']
|
1
2
3
4
5
6
| # Create an inspector object from the engine
inspector = inspect(engine)
# Use the inspector to list the table names
table_names = inspector.get_table_names()
print(table_names)
|
1
| ['census', 'state_fact']
|
1
2
3
4
5
| # Create an engine to the census database - exercise
# engine = create_engine('postgresql+psycopg2://student:datacamp@postgresql.csrrinzqubik.us-east-1.rds.amazonaws.com:5432/census')
# Use the .table_names() method on the engine to print the table names
# print(engine.table_names()) # this no longer works
|
Filter data selected from a Table - Simple
Having connected to the database, it’s now time to practice filtering your queries!
As mentioned in the video, a where() clause is used to filter the data that a statement returns. For example, to select all the records from the census table where the sex is Female (or ‘F’) we would do the following:
select(census).where(census.columns.sex == ‘F’)
In addition to == we can use basically any python comparison operator (such as <=, !=, etc) in the where() clause.
Instructions
- Select all records from the census table by passing in census as a list to select().
- Append a where clause to stmt to return only the records with a state of ‘New York’.
- Execute the statement stmt using .execute() and retrieve the results using .fetchall().
- Iterate over results and print the age, sex and pop2008 columns from each record. For example, you can print out the age of result with result.age.
1
2
3
4
5
6
7
8
9
10
11
12
13
| # Create a select query: stmt
stmt = select(census)
# Add a where clause to filter the results to only those for New York
stmt = stmt.where(census.columns.state == 'New York')
# Execute the query to retrieve all the data returned: results
results = connection.execute(stmt).fetchall()
# Loop over the results and print the age, sex, and pop2008
for i, result in enumerate(results):
if i < 7:
print(result.age, result.sex, result.pop2008)
|
1
2
3
4
5
6
7
| 0 M 128088
1 M 125649
2 M 121615
3 M 120580
4 M 122482
5 M 121205
6 M 120089
|
1
2
3
4
5
6
7
| [('New York', 'M', 0, 126237, 128088),
('New York', 'M', 1, 124008, 125649),
('New York', 'M', 2, 124725, 121615),
('New York', 'M', 3, 126697, 120580),
('New York', 'M', 4, 131357, 122482),
('New York', 'M', 5, 133095, 121205),
('New York', 'M', 6, 134203, 120089)]
|
Filter data selected from a Table - Expressions
In addition to standard Python comparators, we can also use methods such as in_() to create more powerful where() clauses. You can see a full list of expressions in the SQLAlchemy Documentation.
We’ve already created a list of some of the most densely populated states.
Instructions
- Select all records from the census table by passing it in as a list to select().
- Append a where clause to return all the records with a state in the states list. Use in_(states) on census.columns.state to do this.
- Loop over the ResultProxy connection.execute(stmt) and print the state and pop2000 columns from each record.
1
2
3
| stmt = stmt.where(census.columns.state.startswith('New'))
for result in connection.execute(stmt):
print(result.state, result.pop2000)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
| New York 126237
New York 124008
New York 124725
New York 126697
New York 131357
New York 133095
New York 134203
New York 137986
New York 139455
New York 142454
New York 145621
New York 138746
New York 135565
New York 132288
New York 132388
New York 131959
New York 130189
New York 132566
New York 132672
New York 133654
New York 132121
New York 126166
New York 123215
New York 121282
New York 118953
New York 123151
New York 118727
New York 122359
New York 128651
New York 140687
New York 149558
New York 139477
New York 138911
New York 139031
New York 145440
New York 156168
New York 153840
New York 152078
New York 150765
New York 152606
New York 159345
New York 148628
New York 147892
New York 144195
New York 139354
New York 141953
New York 131875
New York 128767
New York 125406
New York 124155
New York 125955
New York 118542
New York 118532
New York 124418
New York 95025
New York 92652
New York 90096
New York 95340
New York 83273
New York 77213
New York 77054
New York 72212
New York 70967
New York 66461
New York 64361
New York 64385
New York 58819
New York 58176
New York 57310
New York 57057
New York 57761
New York 53775
New York 53568
New York 51263
New York 48440
New York 46702
New York 43508
New York 40730
New York 37950
New York 35774
New York 32453
New York 26803
New York 25041
New York 21687
New York 18873
New York 88366
New York 120355
New York 118219
New York 119577
New York 121029
New York 125247
New York 128227
New York 128428
New York 131161
New York 133646
New York 135746
New York 138287
New York 131904
New York 129028
New York 126571
New York 125682
New York 125409
New York 122770
New York 123978
New York 125307
New York 127956
New York 129184
New York 124575
New York 123701
New York 124108
New York 122624
New York 127474
New York 123033
New York 128125
New York 134795
New York 146832
New York 152973
New York 144001
New York 143930
New York 144653
New York 151147
New York 159228
New York 159999
New York 157911
New York 156103
New York 159284
New York 163331
New York 155353
New York 153688
New York 151615
New York 146774
New York 148318
New York 139802
New York 138062
New York 134107
New York 134399
New York 136630
New York 130843
New York 130196
New York 136064
New York 106579
New York 104847
New York 101857
New York 108406
New York 94346
New York 88584
New York 88932
New York 82899
New York 82172
New York 77171
New York 76032
New York 76498
New York 70465
New York 71088
New York 70847
New York 71377
New York 74378
New York 70611
New York 70513
New York 69156
New York 68042
New York 68410
New York 64971
New York 61287
New York 58911
New York 56865
New York 54553
New York 46381
New York 45599
New York 40525
New York 37436
New York 226378
|
1
2
3
4
5
6
7
8
9
10
11
12
| states = ['New York', 'California', 'Texas']
# Create a query for the census table: stmt
stmt = select(census)
# Append a where clause to match all the states in_ the list states
stmt = stmt.where(census.columns.state.in_(states))
# Loop over the ResultProxy and print the state and its population in 2000
for i, result in enumerate(connection.execute(stmt)):
if i < 7:
print(result.state, result.pop2000)
|
1
2
3
4
5
6
7
| New York 126237
New York 124008
New York 124725
New York 126697
New York 131357
New York 133095
New York 134203
|
Filter data selected from a Table - Advanced
You’re really getting the hang of this! SQLAlchemy also allows users to use conjunctions such as and_(), or_(), and not_() to build more complex filtering. For example, we can get a set of records for people in New York who are 21 or 37 years old with the following code:
1
2
3
| select(census).where(and_(census.columns.state == 'New York',
or_(census.columns.age == 21,
census.columns.age == 37)))
|
Instructions
- Import and_ from the sqlalchemy module.
- Select all records from the census table.
- Append a where clause to filter all the records whose state is ‘California’, and whose sex is not ‘M’.
- Iterate over the ResultProxy and print the age and sex columns from each record.
1
2
3
4
5
6
7
8
9
10
11
12
| # Build a query for the census table: stmt
stmt = select(census)
# Append a where clause to select only non-male records from California using and_
# The state of California with a non-male sex
stmt = stmt.where(and_(census.columns.state == 'California',
census.columns.sex != 'M'))
# Loop over the ResultProxy printing the age and sex
for i, result in enumerate(connection.execute(stmt)):
if i < 7:
print(result.age, result.sex)
|
1
2
3
4
5
6
7
| 0 F
1 F
2 F
3 F
4 F
5 F
6 F
|
Ordering Query Results
Order by Clauses
- Allows us to control the order in which records are returned in the query results
- Available as a method on statements order_by()
1
2
3
4
5
6
7
8
| print(results[:10])
Out: [('Illinois',), …]
stmt = select(census.columns.state)
stmt = stmt.order_by(census.columns.state)
results = connection.execute(stmt).fetchall()
print(results[:10])
Out: [('Alabama',), …]
|
Order by Descending
- Wrap the column with desc() in the order_by() clause
Order by Multiple
- Just separate multiple columns with a comma
- Orders completely by the first column
- Then if there are duplicates in the first column, orders by the second column
- repeat until all columns are ordered
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| print(results)
Out:
('Alabama', 'M')
stmt = select(census.columns.state, census.columns.sex)
stmt = stmt.order_by(census.columns.state, census.columns.sex)
results = connection.execute(stmt).first()
print(results)
Out:
('Alabama', 'F')
('Alabama', 'F')
…
('Alabama', 'M')
|
Exercises
1
2
3
4
5
6
7
8
| # Use with local file
engine = create_engine(census_sql_data)
connection = engine.connect()
metadata = MetaData()
metadata.reflect(engine)
# Reflect the 'census' table via engine
census = Table('census', metadata, autoload_with=engine)
|
Ordering by a Single Column
To sort the result output by a field, we use the .order_by() method. By default, the .order_by() method sorts from lowest to highest on the supplied column. You just have to pass in the name of the column you want sorted to .order_by().
In the video, for example, Jason used stmt.order_by(census.columns.state) to sort the result output by the state column.
Instructions
- Select all records of the state column from the census table. To do this, pass census.columns.state as a list to select().
- Append an .order_by() to sort the result output by the state column.
- Execute stmt using the .execute() method on connection and retrieve all the results using .fetchall().
- Print the first 10 rows of results.
1
2
3
4
5
6
7
8
9
10
11
| # Build a query to select the state column: stmt
stmt = select(census.columns.state)
# Order stmt by the state column
stmt = stmt.order_by(census.columns.state)
# Execute the query and store the results: results
results = connection.execute(stmt).fetchall()
# Print the first 10 results
results[:10]
|
1
2
3
4
5
6
7
8
9
10
| [('Alabama',),
('Alabama',),
('Alabama',),
('Alabama',),
('Alabama',),
('Alabama',),
('Alabama',),
('Alabama',),
('Alabama',),
('Alabama',)]
|
Ordering in Descending Order by a Single Column
You can also use .order_by() to sort from highest to lowest by wrapping a column in the desc() function. Although you haven’t seen this function in action, it generalizes what you have already learned.
Pass desc() (for “descending”) inside an .order_by() with the name of the column you want to sort by. For instance, stmt.order_by(desc(table.columns.column_name)) sorts column_name in descending order.
Instructions
- Import desc from the sqlalchemy module.
- Select all records of the state column from the census table.
- Append an .order_by() to sort the result output by the state column in descending order. Save the result as rev_stmt.
- Execute rev_stmt using connection.execute() and fetch all the results with .fetchall(). Save them as rev_results.
- Print the first 10 rows of rev_results.
1
2
3
4
5
6
7
8
9
10
11
| # Build a query to select the state column: stmt
stmt = select(census.columns.state)
# Order stmt by state in descending order: rev_stmt
rev_stmt = stmt.order_by(desc(census.columns.state))
# Execute the query and store the results: rev_results
rev_results = connection.execute(rev_stmt).fetchall()
# Print the first 10 rev_results
rev_results[:10]
|
1
2
3
4
5
6
7
8
9
10
| [('Wyoming',),
('Wyoming',),
('Wyoming',),
('Wyoming',),
('Wyoming',),
('Wyoming',),
('Wyoming',),
('Wyoming',),
('Wyoming',),
('Wyoming',)]
|
Ordering by Multiple Columns
We can pass multiple arguments to the .order_by() method to order by multiple columns. In fact, we can also sort in ascending or descending order for each individual column. Each column in the .order_by() method is fully sorted from left to right. This means that the first column is completely sorted, and then within each matching group of values in the first column, it’s sorted by the next column in the .order_by() method. This process is repeated until all the columns in the .order_by() are sorted.
Instructions
- Select all records of the state and age columns from the census table.
- Use .order_by() to sort the output of the state column in ascending order and age in descending order. (NOTE: desc is already imported).
- Execute stmt using the .execute() method on connection and retrieve all the results using .fetchall().
- Print the first 20 results.
1
2
3
4
5
6
7
8
9
10
11
| # Build a query to select state and age: stmt
stmt = select(census.columns.state, census.columns.age)
# Append order by to ascend by state and descend by age
stmt = stmt.order_by(census.columns.state, desc(census.columns.age))
# Execute the statement and store all the records: results
results = connection.execute(stmt).fetchall()
# Print the first 20 results
results[:20]
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| [('Alabama', 85),
('Alabama', 85),
('Alabama', 84),
('Alabama', 84),
('Alabama', 83),
('Alabama', 83),
('Alabama', 82),
('Alabama', 82),
('Alabama', 81),
('Alabama', 81),
('Alabama', 80),
('Alabama', 80),
('Alabama', 79),
('Alabama', 79),
('Alabama', 78),
('Alabama', 78),
('Alabama', 77),
('Alabama', 77),
('Alabama', 76),
('Alabama', 76)]
|
Counting, Summing and Grouping Data
SQL Functions
- E.g. Count, Sum
- from sqlalchemy import func
- More efficient than processing in Python
- Aggregate data
Sum Example
1
2
3
4
5
| from sqlalchemy import func
stmt = select(func.sum(census.columns.pop2008))
results = connection.execute(stmt).scalar()
print(results)
Out: 302876613
|
Group by
- Allows us to group row by common values
1
2
3
4
5
6
| stmt = select(census.columns.sex, func.sum(census.columns.pop2008))
stmt = stmt.group_by(census.columns.sex)
results = connection.execute(stmt).fetchall()
print(results)
Out: [('F', 153959198), ('M', 148917415)]
|
- Supports multiple columns to group by with a pattern similar to order_by()
- Requires all selected columns to be grouped or aggregated by a function
Group by Multiple
1
2
3
4
5
6
7
8
9
| stmt = select(census.columns.sex, census.columns.age, func.sum(census.columns.pop2008))
stmt = stmt.group_by(census.columns.sex, census.columns.age)
results = connection.execute(stmt).fetchall()
print(results)
Out:
[('F', 0, 2105442), ('F', 1, 2087705), ('F', 2, 2037280), ('F', 3,
2012742), ('F', 4, 2014825), ('F', 5, 1991082), ('F', 6, 1977923),
('F', 7, 2005470), ('F', 8, 1925725), …
|
Handling ResultSets from Functions
- SQLAlchemy auto generates “column names” for functions in the ResultSet
- The column names are often func_# such as count_1
- Replace them with the label() method
Using label()
1
2
3
4
5
6
7
8
| print(results[0].keys())
Out: ['sex', u'sum_1']
stmt = select(census.columns.sex, func.sum(census.columns.pop2008).label( 'pop2008_sum'))
stmt = stmt.group_by(census.columns.sex)
results = connection.execute(stmt).fetchall()
print(results[0].keys())
Out: ['sex', 'pop2008_sum']
|
Exercises
1
2
3
4
5
6
7
8
| # Use with local file
engine = create_engine(census_sql_data)
connection = engine.connect()
metadata = MetaData()
metadata.reflect(engine)
# Reflect the 'census' table via engine
census = Table('census', metadata, autoload_with=engine)
|
Counting Distinct Data
As mentioned in the video, SQLAlchemy’s func module provides access to built-in SQL functions that can make operations like counting and summing faster and more efficient.
In the video, Jason used func.sum() to get a sum of the pop2008 column of census as shown below:
1
| select(func.sum(census.columns.pop2008))
|
If instead you want to count the number of values in pop2008, you could use func.count() like this:
1
| select(func.count(census.columns.pop2008))
|
Furthermore, if you only want to count the distinct values of pop2008, you can use the .distinct() method:
1
| select(func.count(census.columns.pop2008.distinct()))
|
In this exercise, you will practice using func.count() and .distinct() to get a count of the distinct number of states in census.
So far, you’ve seen .fetchall() and .first() used on a ResultProxy to get the results. The ResultProxy also has a method called .scalar() for getting just the value of a query that returns only one row and column.
This can be very useful when you are querying for just a count or sum.
Instructions
- Build a select statement to count the distinct values in the state field of census.
- Execute stmt to get the count and store the results as distinct_state_count.
- Print the value of distinct_state_count.
1
2
3
4
5
6
7
8
| # Build a query to count the distinct states values: stmt
stmt = select(func.count(census.columns.state.distinct()))
# Execute the query and store the scalar result: distinct_state_count
distinct_state_count = connection.execute(stmt).scalar()
# Print the distinct_state_count
distinct_state_count
|
Count of Records by State
Often, we want to get a count for each record with a particular value in another column. The .group_by() method helps answer this type of query. You can pass a column to the .group_by() method and use in an aggregate function like sum() or count(). Much like the .order_by() method, .group_by() can take multiple columns as arguments.
Instructions
- Import func from sqlalchemy.
- Build a select statement to get the value of the state field and a count of the values in the age field, and store it as stmt.
- Use the .group_by() method to group the statement by the state column.
- Execute stmt using the connection to get the count and store the results as results.
- Print the keys/column names of the results returned using results[0].keys().
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| # Create an SQLAlchemy Engine instance. This engine manages connections to the database.
# The `census_sql_data` should be a database URL that specifies database dialect and connection arguments.
engine = create_engine(census_sql_data)
# Create a session. This session establishes and maintains all conversations with the database.
# It represents a 'holding zone' for all the objects which you've loaded or associated with it during its lifespan.
session = Session(bind=engine)
# Create a MetaData instance. MetaData is a container object that keeps together many different features of a database (or multiple databases).
metadata = MetaData()
# Reflect the database schema into MetaData. This loads table definitions from the database automatically.
# The `bind=engine` argument tells MetaData which engine to use for connection.
metadata.reflect(bind=engine)
# Access the 'census' table object from the metadata. This dictionary-style access allows you to get a Table object.
# Each Table object is a member of the MetaData collection.
census = metadata.tables['census']
# Build a SELECT statement. `select(census)` constructs a simple query that selects all columns from the 'census' table.
stmt = select(census.columns.state, func.count(census.columns.age))
# Group stmt by state
stmt = stmt.group_by(census.columns.state)
# Execute the SELECT statement using the session. This sends the SQL statement to the database and returns a result object.
result = session.execute(stmt)
rows = result.fetchall()
# Fetch the first row of the result. `fetchone()` retrieves the next row of a query result set, returning a single sequence, or None if no more rows are available.
first_row = rows[0]
|
1
2
3
4
5
6
7
8
9
10
11
| # Print the column names available in the result. This helps verify the structure of the returned rows and what columns can be accessed.
print(result.keys())
# Print some rows
print(rows[:5])
# Print the type of the first row to verify it's an instance of `Row`, which allows both indexed and keyed access.
print(type(first_row))
print(first_row.state)
print(first_row.count_1)
|
1
2
3
4
5
| RMKeyView(['state', 'count_1'])
[('Alabama', 172), ('Alaska', 172), ('Arizona', 172), ('Arkansas', 172), ('California', 172)]
<class 'sqlalchemy.engine.row.Row'>
Alabama
172
|
result = session.execute(stmt)
is not the same as result = connection.execute(stmt).fetchall()
type(session.execute(stmt))
→ sqlalchemy.engine.cursor.CursorResult
type(connection.execute(stmt).fetchall())
→ list
Determining the Population Sum by State
To avoid confusion with query result column names like count_1, we can use the .label() method to provide a name for the resulting column. This gets appendedto the function method we are using, and its argument is the name we want to use.
We can pair func.sum() with .group_by() to get a sum of the population by State and use the label() method to name the output.
We can also create the func.sum() expression before using it in the select statement. We do it the same way we would inside the select statement and store it in a variable. Then we use that variable in the select statement where the func.sum() would normally be.
Instructions
- Import func from sqlalchemy.
- Build an expression to calculate the sum of the values in the pop2008 field labeled as ‘population’.
- Build a select statement to get the value of the state field and the sum of the values in pop2008.
- Group the statement by state using a .group_by() method.
- Execute stmt using the connection to get the count and store the results as results.
- Print the keys/column names of the results returned using results[0].keys().
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| # Build an expression to calculate the sum of pop2008 labeled as population
pop2008_sum = func.sum(census.columns.pop2008).label('population')
# Build a query to select the state and sum of pop2008: stmt
stmt = select(census.columns.state, pop2008_sum)
# Group stmt by state
stmt = stmt.group_by(census.columns.state)
# Execute the statement and store all the records: results
results = connection.execute(stmt)
# Print the keys/column names of the results returned
print(results.keys())
rows = results.fetchall()
# Print results
rows[:5]
|
1
2
3
4
5
6
7
8
9
10
11
| RMKeyView(['state', 'population'])
[('Alabama', 4649367),
('Alaska', 664546),
('Arizona', 6480767),
('Arkansas', 2848432),
('California', 36609002)]
|
SQLAlchemy and Pandas for Visualization
SQLAlchemy and Pandas
- DataFrame can take a SQLAlchemy ResultSet
- Make sure to set the DataFrame columns to the ResultSet keys
DataFrame Example
1
2
3
4
5
6
7
8
9
10
11
12
13
| import pandas as pd
df = pd.DataFrame(results)
df.columns = results[0].keys()
print(df)
Out:
sex pop2008_sum
0 F 2105442
1 F 2087705
2 F 2037280
3 F 2012742
4 F 2014825
5 F 1991082
|
Graphing
- We can graph just like we would normally
Graphing Example
1
2
3
| import matplotlib.pyplot as plt
df[10:20].plot.barh()
plt.show()
|
Exercises
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| # Use with local file
engine = create_engine(census_sql_data)
connection = engine.connect()
metadata = MetaData()
metadata.reflect(bind=engine)
census = metadata.tables['census']
# Build an expression to calculate the sum of pop2008 labeled as population
pop2008_sum = func.sum(census.columns.pop2008).label('population')
# Build a query to select the state and sum of pop2008: stmt
stmt = select(census.columns.state, pop2008_sum)
stmt = stmt.order_by(desc(pop2008_sum))
# Group stmt by state
stmt = stmt.group_by(census.columns.state)
# Execute the statement and store all the records: results
results = connection.execute(stmt)
rows = results.fetchall()
# Print results
rows[:5]
|
1
2
3
4
5
| [('California', 36609002),
('Texas', 24214127),
('New York', 19465159),
('Florida', 18257662),
('Illinois', 12867077)]
|
SQLAlchemy ResultsProxy and Pandas Dataframes
We can feed a ResultProxy directly into a pandas DataFrame, which is the workhorse of many Data Scientists in PythonLand. Jason demonstrated this in the video. In this exercise, you’ll follow exactly the same approach to convert a ResultProxy into a DataFrame.
Instructions
- Import pandas as pd.
- Create a DataFrame df using pd.DataFrame() on the ResultProxy results.
- Set the columns of the DataFrame df.columns to be the columns from the first result object results[0].keys().
- Print the DataFrame.
1
2
3
4
5
| # Create a DataFrame from the results: df
df = pd.DataFrame(rows, columns=results.keys())
# Print the Dataframe
df.head()
|
| state | population |
---|
0 | California | 36609002 |
---|
1 | Texas | 24214127 |
---|
2 | New York | 19465159 |
---|
3 | Florida | 18257662 |
---|
4 | Illinois | 12867077 |
---|
From SQLAlchemy results to a Graph
We can also take advantage of pandas and Matplotlib to build figures of our data. Remember that data visualization is essential for both exploratory data analysis and communication of your data!
Instructions
- Import matplotlib.pyplot as plt.
- Create a DataFrame df using pd.DataFrame() on the provided results.
- Set the columns of the DataFrame df.columns to be the columns from the first result object results[0].keys().
- Print the DataFrame df.
- Use the plot.bar() method on df to create a bar plot of the results.
- Display the plot with plt.show().
1
2
3
4
| df = pd.DataFrame(rows, columns=results.keys())
# Print the DataFrame
df.head()
|
| state | population |
---|
0 | California | 36609002 |
---|
1 | Texas | 24214127 |
---|
2 | New York | 19465159 |
---|
3 | Florida | 18257662 |
---|
4 | Illinois | 12867077 |
---|
1
2
| # Plot the DataFrame
ax = df.plot(x='state', kind='barh', figsize=(7, 10))
|
Advanced SQLAlchemy Queries
Herein, you will learn to perform advanced - and incredibly useful - queries that will enable you to interact with your data in powerful ways.
Calculating Values in a Query
Math Operators
- addition +
- subtraction -
- multiplication *
- division /
- modulus %
- Work differently on different data types
Calculating Difference
1
2
3
4
5
6
7
8
| stmt = select(census.columns.age, (census.columns.pop2008 - census.columns.pop2000).label('pop_change'))
stmt = stmt.group_by(census.columns.age)
stmt = stmt.order_by(desc('pop_change'))
stmt = stmt.limit(5)
results = connection.execute(stmt).fetchall()
print(results)
Out: [(61, 52672), (85, 51901), (54, 50808), (58, 45575), (60, 44915)]
|
Case Statement
- Used to treat data differently based on a condition
- Accepts a list of conditions to match and a column to return if the condition matches
- The list of conditions ends with an else clause to determine what to do when a record doesn’t match any prior conditions
Case Example
1
2
3
4
5
6
| from sqlalchemy import case
stmt = select(func.sum(case((census.columns.state == 'New York', census.columns.pop2008), else_=0)))
results = connection.execute(stmt).fetchall()
print(results)
Out:[(19465159,)]
|
Cast Statement
- Converts data to another type
- Useful for converting
- integers to floats for division
- strings to dates and times
- Accepts a column or expression and the target Type
Percentage Example
1
2
3
4
5
6
7
| from sqlalchemy import case, cast, Float
stmt = select((func.sum(case((census.columns.state == 'New York', census.columns.pop2008), else_=0)) /
cast(func.sum(census.columns.pop2008), Float) * 100).label('ny_percent'))
results = connection.execute(stmt).fetchall()
print(results)
Out: [(Decimal('6.4267619765'),)]
|
Examples
Connecting to a MySQL Database
Before you jump into the calculation exercises, let’s begin by connecting to our database. Recall that in the last chapter you connected to a PostgreSQL database. Now, you’ll connect to a MySQL database, for which many prefer to use the pymysql database driver, which, like psycopg2 for PostgreSQL, you have to install prior to use.
This connection string is going to start with ‘mysql+pymysql://’, indicating which dialect and driver you’re using to establish the connection. The dialect block is followed by the ‘username:password’ combo. Next, you specify the host and port with the following ‘@host:port/’. Finally, you wrap up the connection string with the ‘database_name’.
Now you’ll practice connecting to a MySQL database: it will be the same census database that you have already been working with. One of the great things about SQLAlchemy is that, after connecting, it abstracts over the type of database it has connected to and you can write the same SQLAlchemy code, regardless!
Instructions
- Import the create_engine function from the sqlalchemy library.
- Create an engine to the census database by concatenating the following strings and passing them to create_engine():
- ‘mysql+pymysql://’ (the dialect and driver).
- ‘student:datacamp’ (the username and password).
- ‘@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:3306/’ (the host and port).
- ‘census’ (the database name).
- Use the .table_names() method on engine to print the table names.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Use with local file
engine2 = create_engine(census_sql_data)
# Create an inspector object from the engine
inspector = inspect(engine2)
# Use the inspector to list the table names
table_names = inspector.get_table_names()
print('Engine Table Names: \n', table_names)
connection = engine2.connect()
metadata = MetaData()
metadata.reflect(bind=engine)
census = metadata.tables['census']
print('\nCensus:')
census
|
1
2
3
4
5
6
7
8
9
10
| Engine Table Names:
['census', 'state_fact']
Census:
Table('census', MetaData(), Column('state', VARCHAR(length=30), table=<census>), Column('sex', VARCHAR(length=1), table=<census>), Column('age', INTEGER(), table=<census>), Column('pop2000', INTEGER(), table=<census>), Column('pop2008', INTEGER(), table=<census>), schema=None)
|
1
2
3
4
5
6
7
8
| # Use for remote connection
# Create an engine to the census database
engine = create_engine('mysql+pymysql://student:datacamp@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:3306/census')
inspector = inspect(engine)
table_names = inspector.get_table_names()
# Print the table names
print('Engine Table Names: \n', table_names)
|
1
2
| Engine Table Names:
['census', 'state_fact']
|
1
2
3
4
5
6
| # Copy and run as code to use with remote engine
connection = engine.connect()
metadata = MetaData()
census = Table('census', metadata, autoload=True, autoload_with=engine)
print('\nCensus:')
census
|
Calculating a Difference between Two Columns
Often, you’ll need to perform math operations as part of a query, such as if you wanted to calculate the change in population from 2000 to 2008. For math operations on numbers, the operators in SQLAlchemy work the same way as they do in Python.
You can use these operators to perform addition (+), subtraction (-), multiplication (*), division (/), and modulus (%) operations. Note: They behave differently when used with non-numeric column types.
Let’s now find the top 5 states by population growth between 2000 and 2008.
Instructions
- Define a select statement called stmt to return:
- i) The state column of the census table (census.columns.state).
- ii) The difference in population count between 2008 (census.columns.pop2008) and 2000 (census.columns.pop2000) labeled as ‘pop_change’.
- Group the statement by census.columns.state.
- Order the statement by population change (‘pop_change’) in descending order. Do so by passing it desc(‘pop_change’).
- Use the .limit() method on the statement to return only 5 records.
- Execute the statement and fetchall() the records.
- The print statement has already been written for you. Hit ‘Submit Answer’ to view the results!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # Build query to return state names by population difference from 2008 to 2000: stmt
stmt = select(census.columns.state, (census.columns.pop2008 - census.columns.pop2000).label('pop_change'))
# Append group by for the state: stmt
stmt = stmt.group_by(census.columns.state)
# Append order by for pop_change descendingly: stmt
stmt = stmt.order_by(desc('pop_change'))
# Return only 5 results: stmt
stmt = stmt.limit(5)
# Use connection to execute the statement and fetch all results
results = connection.execute(stmt).fetchall()
# Print the state and population change for each record
for result in results:
print(f'{result.state}:{result.pop_change}')
|
1
2
3
4
5
| Texas:40137
California:35406
Florida:21954
Arizona:14377
Georgia:13357
|
Determining the Overall Percentage of Females
It’s possible to combine functions and operators in a single select statement as well. These combinations can be exceptionally handy when we want to calculate percentages or averages, and we can also use the case() expression to operate on data that meets specific criteria while not affecting the query as a whole. The case() expression accepts a list of conditions to match and the column to return if the condition matches, followed by an else_ if none of the conditions match. We can wrap this entire expression in any function or math operation we like.
Often when performing integer division, we want to get a float back. While some databases will do this automatically, you can use the cast() function to convert an expression to a particular type.
Instructions
- Import case, cast, and Float from sqlalchemy.
- Build an expression female_pop2000 to calculate female population in 2000. To achieve this:
- Use case() inside func.sum().
- The first argument of case() is a list containing a tuple of
- i) A boolean checking that census.columns.sex is equal to ‘F’.
- ii) The column census.columns.pop2000.
- The second argument is the else_ condition, which should be set to 0.
- Calculate the total population in 2000 and use cast() to convert it to Float.
- Build a query to calculate the percentage of females in 2000. To do this, divide female_pop2000 by total_pop2000 and multiply by 100.
- Execute the query and print percent_female.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # import case, cast and Float from sqlalchemy - at top of notebook
# Build an expression to calculate female population in 2000
female_pop2000 = func.sum(case((census.columns.sex == 'F', census.columns.pop2000), else_=0))
# Cast an expression to calculate total population in 2000 to Float
total_pop2000 = cast(func.sum(census.columns.pop2000), Float)
# Build a query to calculate the percentage of females in 2000: stmt
stmt = select(female_pop2000 / total_pop2000 * 100)
# Execute the query and store the scalar result: percent_female
percent_female = connection.execute(stmt).scalar()
percent_female
|
SQL Relationships
Relationships
- Allow us to avoid duplicate data
- Make it easy to change things in one place
- Useful to break out information from a table we don’t need very often
Relationships
Automatic Joins
1
2
3
4
5
6
7
8
9
10
11
12
| stmt = select(census.columns.pop2008, state_fact.columns.abbreviation)
results = connection.execute(stmt).fetchall()
print(results)
Out:
[(95012, u'IL'),
(95012, u'NJ'),
(95012, u'ND'),
(95012, u'OR'),
(95012, u'DC'),
(95012, u'WI'),
…
|
Join
- Accepts a Table and an optional expression that explains how the two tables are related
- The expression is not needed if the relationship is predefined and available via reflection
- Comes immediately after the select() clause and prior to any where(), order_by or group_by() clauses
Select_from
- Used to replace the default, derived FROM clause with a join
- Wraps the join() clause
Select_from Example
1
2
3
4
5
6
| stmt = select(func.sum(census.columns.pop2000))
stmt = stmt.select_from(census.join(state_fact))
stmt = stmt.where(state_fact.columns.circuit_court == '10')
result = connection.execute(stmt).scalar()
print(result)
Out: 14945252
|
Joining Tables without Predefined Relationship
- Join accepts a Table and an optional expression that explains how the two tables are related
- Will only join on data that match between the two columns
- Avoid joining on columns of different types
Select_from Example
1
2
3
4
5
6
7
| stmt = select(func.sum(census.columns.pop2000))
stmt = stmt.select_from(census.join(state_fact, census.columns.state == state_fact.columns.name))
stmt = stmt.where(state_fact.columns.census_division_name == 'East South Central')
result = connection.execute(stmt).scalar()
print(result)
Out: 16982311
|
Examples
Automatic Joins with an Established Relationship
If you have two tables that already have an established relationship, you can automatically use that relationship by just adding the columns we want from each table to the select statement. Recall that Jason constructed the following query:
1
| stmt = select(census.columns.pop2008, state_fact.columns.abbreviation)
|
in order to join the census and state_fact tables and select the pop2008 column from the first and the abbreviation column from the second. In this case, the census and state_fact tables had a pre-defined relationship: the state column of the former corresponded to the name column of the latter.
In this exercise, you’ll use the same predefined relationship to select the pop2000 and abbreviation columns!
Instructions
- Build a statement to join the census and state_fact tables and select the pop2000 column from the first and the abbreviation column from the second.
- Execute the statement to get the first result and save it as result.
- Hit ‘Submit Answer’ to loop over the keys of the result object, and print the key and value for each!
1
2
| state_fact = Table('state_fact', metadata, autoload=True, autoload_with=engine2)
state_fact
|
1
| Table('state_fact', MetaData(), Column('id', VARCHAR(length=256), table=<state_fact>), Column('name', VARCHAR(length=256), table=<state_fact>), Column('abbreviation', VARCHAR(length=256), table=<state_fact>), Column('country', VARCHAR(length=256), table=<state_fact>), Column('type', VARCHAR(length=256), table=<state_fact>), Column('sort', VARCHAR(length=256), table=<state_fact>), Column('status', VARCHAR(length=256), table=<state_fact>), Column('occupied', VARCHAR(length=256), table=<state_fact>), Column('notes', VARCHAR(length=256), table=<state_fact>), Column('fips_state', VARCHAR(length=256), table=<state_fact>), Column('assoc_press', VARCHAR(length=256), table=<state_fact>), Column('standard_federal_region', VARCHAR(length=256), table=<state_fact>), Column('census_region', VARCHAR(length=256), table=<state_fact>), Column('census_region_name', VARCHAR(length=256), table=<state_fact>), Column('census_division', VARCHAR(length=256), table=<state_fact>), Column('census_division_name', VARCHAR(length=256), table=<state_fact>), Column('circuit_court', VARCHAR(length=256), table=<state_fact>), schema=None)
|
1
2
3
4
5
6
7
8
9
10
11
12
| # Build a statement to join census and state_fact tables: stmt
stmt = select(census.columns.pop2000, state_fact.columns.abbreviation)
# Execute the statement and get the first result: result
result = connection.execute(stmt)
# extract the rows
row = result.fetchone()
# Loop over the keys in the result object and print the key and value
for key in result.keys():
print(key, getattr(row, key))
|
1
2
3
4
5
6
| pop2000 89600
abbreviation IL
C:\Users\trenton\AppData\Local\Temp\ipykernel_54724\3953660652.py:5: SAWarning: SELECT statement has a cartesian product between FROM element(s) "census" and FROM element "state_fact". Apply join condition(s) between each element to resolve.
result = connection.execute(stmt)
|
Joins
If you aren’t selecting columns from both tables or the two tables don’t have a defined relationship, you can still use the .join() method on a table to join it with another table and get extra data related to our query. The join() takes the table object you want to join in as the first argument and a condition that indicates how the tables are related to the second argument. Finally, you use the .select_from() method on the select statement to wrap the join clause. For example, in the video, Jason executed the following code to join the census table to the state_fact table such that the state column of the census table corresponded to the name column of the state_fact table.
1
2
3
4
| stmt = stmt.select_from(
census.join(
state_fact, census.columns.state ==
state_fact.columns.name)
|
Instructions
- Build a statement to select ALL the columns from the census and state_fact tables. To select ALL the columns from two tables employees and sales, for example, you would use stmt = select(employees, sales).
- Append a select_from to stmt to join the census table to the state_fact table by the state column in census and the name column in the state_fact table.
- Execute the statement to get the first result and save it as result. This code is already written.
- Hit ‘Submit Answer’ to loop over the keys of the result object, and print the key and value for each!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # Build a statement to select the census and state_fact tables: stmt
stmt = select(census, state_fact)
# Add a select_from clause that wraps a join for the census and state_fact
# tables where the census state column and state_fact name column match
stmt = stmt.select_from(census.join(state_fact, census.columns.state == state_fact.columns.name))
# Execute the statement and get the first result: result
result = connection.execute(stmt)
# extract the rows
row = result.fetchone()
# Loop over the keys in the result object and print the key and value
for key in result.keys():
print(key, getattr(row, key))
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| state Illinois
sex M
age 0
pop2000 89600
pop2008 95012
id 13
name Illinois
abbreviation IL
country USA
type state
sort 10
status current
occupied occupied
notes
fips_state 17
assoc_press Ill.
standard_federal_region V
census_region 2
census_region_name Midwest
census_division 3
census_division_name East North Central
circuit_court 7
|
More Practice with Joins
You can use the same select statement you built in the last exercise, however, let’s add a twist and only return a few columns and use the other table in a group_by() clause.
Instructions
- Build a statement to select:
- The state column from the census table.
- The sum of the pop2008 column from the census table.
- The census_division_name column from the state_fact table.
- Append a .select_from() to stmt in order to join the census and state_fact tables by the state and name columns.
- Group the statement by the name column of the state_fact table.
- Execute the statement to get all the records and save it as results.
- Hit ‘Submit Answer’ to loop over the results object and print each record.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # Build a statement to select the state, sum of 2008 population and census
# division name: stmt
stmt = select(census.columns.state,
func.sum(census.columns.pop2008),
state_fact.columns.census_division_name)
# Append select_from to join the census and state_fact tables by the census state and state_fact name columns
stmt = stmt.select_from(census.join(state_fact, census.columns.state == state_fact.columns.name))
# Append a group by for the state_fact name column
stmt = stmt.group_by(state_fact.columns.name)
# Execute the statement and get the results: results
results = connection.execute(stmt).fetchall()
# Loop over the the results object and print each record.
for record in results:
print(record)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
| ('Alabama', 4649367, 'East South Central')
('Alaska', 664546, 'Pacific')
('Arizona', 6480767, 'Mountain')
('Arkansas', 2848432, 'West South Central')
('California', 36609002, 'Pacific')
('Colorado', 4912947, 'Mountain')
('Connecticut', 3493783, 'New England')
('Delaware', 869221, 'South Atlantic')
('Florida', 18257662, 'South Atlantic')
('Georgia', 9622508, 'South Atlantic')
('Hawaii', 1250676, 'Pacific')
('Idaho', 1518914, 'Mountain')
('Illinois', 12867077, 'East North Central')
('Indiana', 6373299, 'East North Central')
('Iowa', 3000490, 'West North Central')
('Kansas', 2782245, 'West North Central')
('Kentucky', 4254964, 'East South Central')
('Louisiana', 4395797, 'West South Central')
('Maine', 1312972, 'New England')
('Maryland', 5604174, 'South Atlantic')
('Massachusetts', 6492024, 'New England')
('Michigan', 9998854, 'East North Central')
('Minnesota', 5215815, 'West North Central')
('Mississippi', 2922355, 'East South Central')
('Missouri', 5891974, 'West North Central')
('Montana', 963802, 'Mountain')
('Nebraska', 1776757, 'West North Central')
('Nevada', 2579387, 'Mountain')
('New Hampshire', 1314533, 'New England')
('New Jersey', 8670204, 'Mid-Atlantic')
('New Mexico', 1974993, 'Mountain')
('New York', 19465159, 'Mid-Atlantic')
('North Carolina', 9121606, 'South Atlantic')
('North Dakota', 634282, 'West North Central')
('Ohio', 11476782, 'East North Central')
('Oklahoma', 3620620, 'West South Central')
('Oregon', 3786824, 'Pacific')
('Pennsylvania', 12440129, 'Mid-Atlantic')
('Rhode Island', 1046535, 'New England')
('South Carolina', 4438870, 'South Atlantic')
('South Dakota', 800997, 'West North Central')
('Tennessee', 6202407, 'East South Central')
('Texas', 24214127, 'West South Central')
('Utah', 2730919, 'Mountain')
('Vermont', 620602, 'New England')
('Virginia', 7648902, 'South Atlantic')
('Washington', 6502019, 'Pacific')
('West Virginia', 1812879, 'South Atlantic')
('Wisconsin', 5625013, 'East North Central')
('Wyoming', 529490, 'Mountain')
|
Working with Hierarchical Tables
Hierarchical Tables
- Contain a relationship with themselves
- Commonly found in:
- Organizational
- Geographic
- Network
- Graph
Hierarchical Tables - Example
Hierarchical Tables - alias()
- Requires a way to view the table via multiple names
- Creates a unique reference that we can use
Querying Hierarchical Data
1
2
3
4
5
6
7
8
9
| managers = employees.alias()
stmt = select(managers.columns.name.label('manager'), employees.columns.name.label('employee'))
stmt = stmt.select_from(employees.join(managers, managers.columns.id == employees.columns.manager)
stmt = stmt.order_by(managers.columns.name)
print(connection.execute(stmt).fetchall())
Out: [(u'FILLMORE', u'GRANT'),
(u'FILLMORE', u'ADAMS'),
(u'HARDING', u'TAFT'), ...
|
Group_by and Func
- It’s important to target group_by() at the right alias
- Be careful with what you perform functions on
- If you don’t find yourself using both the alias and the table name for a query, don’t create the alias at all
Querying Hierarchical Data
1
2
3
4
5
6
7
8
9
10
| managers = employees.alias()
stmt = select(managers.columns.name, func.sum(employees.columns.sal))
stmt = stmt.select_from(employees.join(managers, managers.columns.id == employees.columns.manager)
stmt = stmt.group_by(managers.columns.name)
print(connection.execute(stmt).fetchall())
Out: [(u'FILLMORE', Decimal('96000.00')),
(u'GARFIELD', Decimal('83500.00')),
(u'HARDING', Decimal('52000.00')),
(u'JACKSON', Decimal('197000.00'))]
|
Examples
Employees Data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # Use with local file
engine = create_engine(employees_sql_data)
# Create an inspector object from the engine
inspector = inspect(engine)
# Use the inspector to list the table names
table_names = inspector.get_table_names()
print('Engine Table Names: \n', table_names)
connection = engine.connect()
metadata = MetaData()
# Reflect the database schema into MetaData. This loads table definitions from the database automatically.
# The `bind=engine` argument tells MetaData which engine to use for connection.
metadata.reflect(bind=engine)
employees = metadata.tables['employees']
|
1
2
| Engine Table Names:
['employees']
|
Using alias to handle same table joined queries
Often, you’ll have tables that contain hierarchical data, such as employees and managers who are also employees. For this reason, you may wish to join a table to itself on different columns. The .alias() method, which creates a copy of a table, helps accomplish this task. Because it’s the same table, you only need a where clause to specify the join condition.
Here, you’ll use the .alias() method to build a query to join the employees table against itself to determine to whom everyone reports.
Instructions
- Save an alias of the employees table as managers. To do so, apply the method .alias() to employees.
- Build a query to select the employee name and their manager’s name. The manager’s name has already been selected for you. Use label to label the name column of employees as ‘employee’.
- Append a where clause to stmt to match where the id column of the managers table corresponds to the mgr column of the employees table.
- Order the statement by the name column of the managers table.
- Execute the statement and store all the results. This code is already written. Hit ‘Submit Answer’ to print the names of the managers and all their employees.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # Make an alias of the employees table: managers
managers = employees.alias()
# Build a query to select manager's and their employees names: stmt
stmt = select(managers.columns.name.label('manager'), employees.columns.name.label('employee'))
# Match managers id with employees mgr: stmt
stmt = stmt.where(managers.columns.id == employees.columns.mgr)
# Order the statement by the managers name: stmt
stmt = stmt.order_by(managers.columns.name)
# Execute statement: results
results = connection.execute(stmt).fetchall()
# Print records
for record in results:
print(record)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
| ('FILLMORE', 'GRANT')
('FILLMORE', 'ADAMS')
('FILLMORE', 'MONROE')
('GARFIELD', 'JOHNSON')
('GARFIELD', 'LINCOLN')
('GARFIELD', 'POLK')
('GARFIELD', 'WASHINGTON')
('HARDING', 'TAFT')
('HARDING', 'HOOVER')
('JACKSON', 'HARDING')
('JACKSON', 'GARFIELD')
('JACKSON', 'FILLMORE')
('JACKSON', 'ROOSEVELT')
|
Leveraging Functions and Group_bys with Hierarchical Data
It’s also common to want to roll up data which is in a hierarchical table. Rolling up data requires making sure you’re careful which alias you use to perform the group_bys and which table you use for the function.
Here, your job is to get a count of employees for each manager.
Instructions
- Save an alias of the employees table as managers.
- Build a query to select the name column of the managers table and the count of the number of their employees. The function func.count() has been imported and will be useful! Use it to count the id column of the employees table.
- Using a .where() clause, filter the records where the id column of the managers table and mgr column of the employees table are equal.
- Group the query by the name column of the managers table.
- Execute the statement and store all the results. Print the names of the managers and their employees. This code has already been written so hit ‘Submit Answer’ and check out the results!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # Make an alias of the employees table: managers
managers = employees.alias()
# Build a query to select managers and counts of their employees: stmt
stmt = select(managers.columns.name, func.count(employees.columns.id))
# Append a where clause that ensures the manager id and employee mgr are equal
stmt = stmt.where(managers.columns.id == employees.columns.mgr)
# Group by Managers Name
stmt = stmt.group_by(managers.columns.name)
# Execute statement: results
results = connection.execute(stmt).fetchall()
# print manager
for record in results:
print(record)
|
1
2
3
4
| ('FILLMORE', 3)
('GARFIELD', 4)
('HARDING', 2)
('JACKSON', 4)
|
Handling Large Result Sets
Dealing with Large ResultSets
- fetchmany() lets us specify how many rows we want to act upon
- We can loop over fetchmany()
- It returns an empty list when there are no more records
- We have to close the ResultProxy afterwards
Fetching Many Rows
1
2
3
4
5
6
7
| while more_results:
partial_results = results_proxy.fetchmany(50)
if partial_results == []:
more_results = False
for row in partial_results:
state_count[row.state] += 1
results_proxy.close()
|
Examples
Census Data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Import select - at top of Notebook
engine = create_engine(census_sql_data)
connection = engine.connect()
metadata = MetaData()
metadata.reflect(bind=engine)
census = metadata.tables['census']
# Build select statement for census table: stmt
stmt = select(census)
# Print the emitted statement to see the SQL emitted
print(stmt)
|
1
2
| SELECT census.state, census.sex, census.age, census.pop2000, census.pop2008
FROM census
|
1
2
| # Execute the statement and print the results
results_proxy = connection.execute(stmt)
|
Working on Blocks of Records
Fantastic work so far! As Jason discussed in the video, sometimes you may have the need to work on a large ResultProxy, and you may not have the memory to load all the results at once. To work around that issue, you can get blocks of rows from the ResultProxy by using the .fetchmany() method inside a loop. With .fetchmany(), give it an argument of the number of records you want. When you reach an empty list, there are no more rows left to fetch, and you have processed all the results of the query. Then you need to use the .close() method to close out the connection to the database.
You’ll now have the chance to practice this on a large ResultProxy called results_proxy that has been pre-loaded for you to work with.
Instructions
- Use a while loop that checks if there are more_results.
- Inside the loop, apply the method .fetchmany() to results_proxy to get 50 records at a time and store those records as partial_results.
- After fetching the records, if partial_results is an empty list (that is, if it is equal to []), set more_results to False.
- Loop over the partial_results and, if row.state is a key in the state_count dictionary, increment state_count[row.state] by 1; otherwise set state_count[row.state] to 1.
- After the while loop, close the ResultProxy results_proxy using .close().
- Hit ‘Submit Answer’ to print state_count.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| # set more_results & state_count
more_results = True
state_count = dict()
# Start a while loop checking for more results
while more_results:
# Fetch the first 50 results from the ResultProxy: partial_results
partial_results = results_proxy.fetchmany(50)
# if empty list, set more_results to False
if partial_results == []:
more_results = False
# Loop over the fetched records and increment the count for the state
for row in partial_results:
if row.state in state_count:
state_count[row.state] += 1
else:
state_count[row.state] = 1
# Close the ResultProxy, and thus the connection
results_proxy.close()
# Print the count by state
pp(state_count)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
| {'Alabama': 172,
'Alaska': 172,
'Arizona': 172,
'Arkansas': 172,
'California': 172,
'Colorado': 172,
'Connecticut': 172,
'Delaware': 172,
'District of Columbia': 172,
'Florida': 172,
'Georgia': 172,
'Hawaii': 172,
'Idaho': 172,
'Illinois': 172,
'Indiana': 172,
'Iowa': 172,
'Kansas': 172,
'Kentucky': 172,
'Louisiana': 172,
'Maine': 172,
'Maryland': 172,
'Massachusetts': 172,
'Michigan': 172,
'Minnesota': 172,
'Mississippi': 172,
'Missouri': 172,
'Montana': 172,
'Nebraska': 172,
'Nevada': 172,
'New Hampshire': 172,
'New Jersey': 172,
'New Mexico': 172,
'New York': 172,
'North Carolina': 172,
'North Dakota': 172,
'Ohio': 172,
'Oklahoma': 172,
'Oregon': 172,
'Pennsylvania': 172,
'Rhode Island': 172,
'South Carolina': 172,
'South Dakota': 172,
'Tennessee': 172,
'Texas': 172,
'Utah': 172,
'Vermont': 172,
'Virginia': 172,
'Washington': 172,
'West Virginia': 172,
'Wisconsin': 172,
'Wyoming': 172}
|
Creating and Manipulating your own Databases
In the previous chapters, you interacted with existing databases and queried them in various different ways. Now, you will learn how to build your own databases and keep them updated!
Creating Databases and Tables
Creating Databases
- Varies by the database type
- Databases like PostgreSQL and MySQL have command line tools to initialize the database
- With SQLite, the create_engine() statement will create the database and file is they do not already exist
Building a Table
1
2
3
4
5
6
7
8
9
| from sqlalchemy import (Table, Column, String, Integer, Decimal, Boolean)
employees = Table('employees', metadata,
Column('id', Integer()),
Column('name', String(255)),
Column('salary', Decimal()),
Column('active', Boolean()))
metadata.create_all(engine)
engine.table_names()
[u'employees']
|
Creating Tables
- Still uses the Table object like we did for reflection
- Replaces the autoload keyword arguments with Column objects
- Creates the tables in the actual database by using the create_all() method on the MetaData instance
- You need to use other tools to handle database table updates, such as Alembic or raw SQL
Creating Tables - Additional Column Options
- unique forces all values for the data in a column to be unique
- nullable determines if a column can be empty in a row
- default sets a default value if one isn’t supplied.
Building a Table with Additional Options
1
2
3
4
5
6
7
8
9
10
11
| employees = Table('employees', metadata,
Column('id', Integer()),
Column('name', String(255), unique=True, nullable=False),
Column('salary', Float(), default=100.00),
Column('active', Boolean(), default=True))
employees.constraints
Out: {CheckConstraint(Column('name', String(length=255), table=<employees>, nullable=False),
Column('salary', Float(), table=<employees>, default=ColumnDefault(100.0)),
Column('active', Boolean(), table=<employees>, default=ColumnDefault(True)) ...
UniqueConstraint(Column('name', String(length=255), table=<employees>, nullable=False))}
|
Examples
Creating Tables with SQLAlchemy
Previously, you used the Table object to reflect a table from an existing database, but what if you wanted to create a new table? You’d still use the Table object; however, you’d need to replace the autoload and autoload_with parameters with Column objects.
The Column object takes a name, a SQLAlchemy type with an optional format, and optional keyword arguments for different constraints.
When defining the table, recall how in the video Jason passed in 255 as the maximum length of a String by using Column(‘name’, String(255)). Checking out the slides from the video may help: you can download them by clicking on ‘Slides’ next to the IPython Shell.
After defining the table, you can create the table in the database by using the .create_all() method on metadata and supplying the engine as the only parameter. Go for it!
Instructions
- Import Table, Column, String, Integer, Float, Boolean from sqlalchemy.
- Build a new table called data with columns ‘name’ (String(255)), ‘count’ (Integer()), ‘amount’(Float()), and ‘valid’ (Boolean()) columns. The second argument of Table() needs to be metadata, which has already been initialized.
- Create the table in the database by passing engine to metadata.create_all().
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| metadata = MetaData()
# Define a new table with a name, count, amount, and valid column: data
test_data = Table('test_data', metadata,
Column('name', String(255)),
Column('count', Integer()),
Column('amount', Float()),
Column('valid', Boolean()))
engine = create_engine('sqlite:///data/intro_to_databases_in_python/test_data.sqlite')
connection = engine.connect()
# Use the metadata to create the table
metadata.create_all(engine)
# Print table details
print(repr(test_data))
connection.close()
|
1
| Table('test_data', MetaData(), Column('name', String(length=255), table=<test_data>), Column('count', Integer(), table=<test_data>), Column('amount', Float(), table=<test_data>), Column('valid', Boolean(), table=<test_data>), schema=None)
|
Constraints and Data Defaults
You’re now going to practice creating a table with some constraints! Often, you’ll need to make sure that a column is unique, nullable, a positive value, or related to a column in another table. This is where constraints come in.
As Jason showed you in the video, in addition to constraints, you can also set a default value for the column if no data is passed to it via the default keyword on the column.
Instructions
- Table, Column, String, Integer, Float, Boolean are already imported from sqlalchemy.
- Build a new table called data with a unique name (String), count (Integer) defaulted to 1, amount (Float), and valid (Boolean) defaulted to False.
- Hit ‘Submit Answer’ to create the table in the database and to print the table details for data.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| metadata = MetaData()
# Define a new table with a name, count, amount, and valid column: data
test_data = Table('test_data', metadata,
Column('name', String(255), unique=True),
Column('count', Integer(), default=1),
Column('amount', Float()),
Column('valid', Boolean(), default=False),
extend_existing=True)
engine = create_engine('sqlite:///:memory:')
connection = engine.connect()
# Use the metadata to create the table
metadata.create_all(engine)
# Print the table details
print(repr(metadata.tables['test_data']))
connection.close()
|
1
| Table('test_data', MetaData(), Column('name', String(length=255), table=<test_data>), Column('count', Integer(), table=<test_data>, default=ScalarElementColumnDefault(1)), Column('amount', Float(), table=<test_data>), Column('valid', Boolean(), table=<test_data>, default=ScalarElementColumnDefault(False)), schema=None)
|
Inserting Data into a Table
Adding Data to a Table
- Done with the insert() statement
- Insert() takes the table we are loading data into as the argument
- We add all the values we want to insert in with the values clause as column=value pairs
- Doesn’t return any rows, so no need for a fetch method
Inserting One Row
1
2
3
4
5
| from sqlalchemy import insert
stmt = insert(employees).values(id=1, name='Jason', salary=1.00, active=True)
result_proxy = connection.execute(stmt)
print(result_proxy.rowcount)
Out: 1
|
Inserting Multiple Rows
- Build an insert statement without any values
- Build a list of dictionaries that represent all the values clauses for the rows you want to insert
- Pass both the stmt and the values list to the execute method on connection
1
2
3
4
5
6
| stmt = insert(employees)
values_list = [{'id': 2, 'name': 'Rebecca', 'salary': 2.00, 'active': True},
{'id': 3, 'name': 'Bob', 'salary': 0.00, 'active': False}]
result_proxy = connection.execute(stmt, values_list)
print(result_proxy.rowcount)
Out: 2
|
Examples
Inserting a single row with an insert() statement
There are several ways to perform an insert with SQLAlchemy; however, we are going to focus on the one that follows the same pattern as the select statement.
It uses an insert statement where you specify the table as an argument, and supply the data you wish to insert into the value via the .values() method as keyword arguments.
Here, the name of the table is data.
Instructions
- Import insert and select from the sqlalchemy module.
- Build an insert statement for the data table to set name to ‘Anna’, count to 1, amount to 1000.00, and valid to True. Save the statement as stmt.
- Execute stmt with the connection and store the results.
- Print the rowcount attribute of results to see how many records were inserted.
- Build a select statement to query for the record with the name of ‘Anna’.
- Hit ‘Submit Answer’ to print the results of executing the select statement.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # Use with local file
engine = create_engine('sqlite:///data/intro_to_databases_in_python/test_data.sqlite')
inspector = inspect(engine)
table_names = inspector.get_table_names()
print('Engine Table Names: \n', table_names)
connection = engine.connect()
metadata = MetaData()
metadata.reflect(bind=engine)
test_data = metadata.tables['test_data']
print('\nTest Data:')
test_data
|
1
2
3
4
5
6
7
8
9
10
| Engine Table Names:
['test_data']
Test Data:
Table('test_data', MetaData(), Column('name', VARCHAR(length=255), table=<test_data>), Column('count', INTEGER(), table=<test_data>), Column('amount', FLOAT(), table=<test_data>), Column('valid', BOOLEAN(), table=<test_data>), schema=None)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # Build an insert statement to insert a record into the data table: stmt
stmt = insert(test_data).values(name='Anna', count=1, amount=1000.00, valid=True)
# Execute the statement via the connection: results
results = connection.execute(stmt)
# Print result rowcount
print(results.rowcount)
# Build a select statement to validate the insert
stmt = select(test_data).where(test_data.columns.name == 'Anna')
# Print the result of executing the query.
print(connection.execute(stmt).first())
connection.close()
|
1
2
| 1
('Anna', 1, 1000.0, True)
|
Inserting Multiple Records at Once
It’s time to practice inserting multiple records at once!
As Jason showed you in the video, you’ll want to first build a list of dictionaries that represents the data you want to insert. Then, in the .execute() method, you can pair this list of dictionaries with an insert statement, which will insert all the records in your list of dictionaries.
Instructions
- Build a list of dictionaries called values_list with two dictionaries. In the first dictionary set name to ‘Anna’, count to 1, amount to 1000.00, and valid to True. In the second dictionary of the list, set name to ‘Taylor’, count to 1, amount to 750.00, and valid to False.
- Build an insert statement for the data table for a multiple insert, save it as stmt.
- Execute stmt with the values_list via connection and store the results. Make sure values_list is the second argument to .execute().
- Print the rowcount of the results.
1
2
3
4
5
6
7
8
9
10
11
12
13
| engine = create_engine('sqlite:///data/intro_to_databases_in_python/test_data.sqlite')
inspector = inspect(engine)
table_names = inspector.get_table_names()
print('Engine Table Names: \n', table_names)
connection = engine.connect()
metadata = MetaData()
metadata.reflect(bind=engine)
test_data = metadata.tables['test_data']
print('\nTest Data:')
test_data
|
1
2
3
4
5
6
7
8
9
10
| Engine Table Names:
['test_data']
Test Data:
Table('test_data', MetaData(), Column('name', VARCHAR(length=255), table=<test_data>), Column('count', INTEGER(), table=<test_data>), Column('amount', FLOAT(), table=<test_data>), Column('valid', BOOLEAN(), table=<test_data>), schema=None)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
| # Build a list of dictionaries: values_list
values_list = [{'name': 'Alexandria', 'count': 1, 'amount': 1000.00, 'valid': True},
{'name': 'Taylor', 'count': 1, 'amount': 750.00, 'valid': False}]
# Build an insert statement for the data table: stmt
stmt = insert(test_data)
# Execute stmt with the values_list: results
results = connection.execute(stmt, values_list)
# Print rowcount
print(results.rowcount)
connection.close()
|
Loading a CSV into a Table
You’ve done a great job so far at inserting data into tables! You’re now going to learn how to load the contents of a CSV file into a table.
We have used the csv module to set up a csv_reader, which is just a reader object that can iterate over the lines in a given CSV file - in this case, a census CSV file. Using the enumerate() function, you can loop over the csv_reader to handle the results one at a time. Here, for example, the first line it would return is:
1
| 0 ['Illinois', 'M', '0', '89600', '95012']
|
0 is the idx - or line number - while [‘Illinois’, ‘M’, ‘0’, ‘89600’, ‘95012’] is the row, corresponding to the column names ‘state’ , ‘sex’, ‘age’, ‘pop2000 ‘and ‘pop2008’. ‘Illinois’ can be accessed with row[0], ‘M’ with row[1], and so on. You can create a dictionary containing this information where the keys are the column names and the values are the entries in each line. Then, by appending this dictionary to a list, you can combine it with an insert statement to load it all into a table!
Instructions
- Create a statement for bulk insert into the census table. To do this, just use insert() and census.
- Create an empty list called values_list and a variable called total_rowcount that is set to 0.
- Within the for loop:
- Complete the data dictionary by filling in the values for each of the keys. The values are contained in row. row[0] represents the value for ‘state’, row[1] represents the value for ‘sex’, and so on.
- Append data to values_list.
- If 51 cleanly divides into the current idx:
- Execute stmt with the values_list and save it as results.
- Hit ‘Submit Answer’ to print total_rowcount when done with all the records.
1
2
3
4
5
6
| csv_reader = list()
with open(census_csv_data, newline='\n') as csvfile:
file_reader = csv.reader(csvfile)
for row in file_reader:
csv_reader.append(row)
|
1
2
3
4
5
6
7
8
9
10
11
| engine = create_engine(census_sql_data)
inspector = inspect(engine)
table_names = inspector.get_table_names()
print('Engine Table Names: \n', table_names)
connection = engine.connect()
metadata = MetaData()
metadata.reflect(bind=engine)
census = metadata.tables['census']
print('\nTest Data:')
census
|
1
2
3
4
5
6
7
8
9
10
| Engine Table Names:
['census', 'state_fact']
Test Data:
Table('census', MetaData(), Column('state', VARCHAR(length=30), table=<census>), Column('sex', VARCHAR(length=1), table=<census>), Column('age', INTEGER(), table=<census>), Column('pop2000', INTEGER(), table=<census>), Column('pop2008', INTEGER(), table=<census>), schema=None)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| # Create a insert statement for census: stmt
stmt = insert(census)
# Create an empty list and zeroed row count: values_list, total_rowcount
values_list = []
total_rowcount = 0
# Enumerate the rows of csv_reader
for idx, row in enumerate(csv_reader):
#create data and append to values_list
data = {'state': row[0], 'sex': row[1], 'age': row[2], 'pop2000': row[3], 'pop2008': row[4]}
values_list.append(data)
# Check to see if divisible by 51
if idx % 51 == 0:
results = connection.execute(stmt, values_list)
total_rowcount += results.rowcount
values_list = []
# Print total rowcount
print(total_rowcount)
connection.close()
|
Updating Data in a Database
Updating Data in a Table
- Done with the update statement
- Similar to the insert statement but includes a where clause to determine what record will be updated
- We add all the values we want to update with the values clause as column=value pairs
Updating One Row
1
2
3
4
5
6
7
| from sqlalchemy import update
stmt = update(employees)
stmt = stmt.where(employees.columns.id == 3)
stmt = stmt.values(active=True)
result_proxy = connection.execute(stmt)
print(result_proxy.rowcount)
Out: 1
|
Updating Multiple Rows
- Build a where clause that will select all the records you want to update
Inserting Multiple Rows
1
2
3
4
5
6
| stmt = update(employees)
stmt = stmt.where(employees.columns.active == True)
stmt = stmt.values(active=False, salary=0.00)
result_proxy = connection.execute(stmt)
print(result_proxy.rowcount)
Out: 3
|
1
2
3
4
5
6
7
8
9
| new_salary = select(employees.columns.salary)
new_salary = new_salary.order_by(desc(employees.columns.salary))
new_salary = new_salary.limit(1)
stmt = update(employees)
stmt = stmt.values(salary=new_salary)
result_proxy = connection.execute(stmt)
print(result_proxy.rowcount)
Out: 3
|
- Uses a select() statement to find the value for the column we are updating
- Commonly used to update records to a maximum value or change a string to match an abbreviation from another table
Examples - Original
Following is the original content for the class, which has not been updated to run sqlalchemy 2.0.29
.
Updating individual records
The update statement is very similar to an insert statement, except that it also typically uses a where clause to help us determine what data to update. You’ll be using the FIPS state code using here, which is appropriated by the U.S. government to identify U.S. states and certain other associated areas. Recall that you can update all wages in the employees table as follows:
1
| stmt = update(employees).values(wage=100.00)
|
For your convenience, the names of the tables and columns of interest in this exercise are: state_fact (Table), name (Column), and fips_state (Column).
Instructions
- Build a statement to select all columns from the state_fact table where the name column is New York. Call it select_stmt.
- Print the results of executing the select_stmt and fetching all records.
- Build an update statement to change the fips_state column code to 36, save it as stmt.
- Use a where clause to filter for states with the name of ‘New York’ in the state_fact table.
- Execute stmt via the connection and save the output as results.
- Hit ‘Submit Answer’ to print the rowcount of the results and the results of executing select_stmt. This will verify the fips_state code is now 36.
1
2
3
4
5
6
7
8
| engine = create_engine(census_sql_data)
print('Engine Table Names: \n', engine.table_names())
connection = engine.connect()
metadata = MetaData()
state_fact = Table('state_fact', metadata, autoload=True, autoload_with=engine)
print('\nTest Data:')
state_fact
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # Build a select statement: select_stmt
select_stmt = select([state_fact]).where(state_fact.columns.name == 'New York')
# Print the results of executing the select_stmt
print(connection.execute(select_stmt).fetchall())
# Build a statement to update the fips_state to 36: stmt
stmt = update(state_fact).values(fips_state = 36)
# Append a where clause to limit it to records for New York state
stmt = stmt.where(state_fact.columns.name == 'New York')
# Execute the statement: results
results = connection.execute(stmt)
# Print rowcount
print(results.rowcount)
# Execute the select_stmt again to view the changes
print(connection.execute(select_stmt).fetchall())
connection.close()
|
Updating Multiple Records
As Jason discussed in the video, by using a where clause that selects more records, you can update multiple records at once. It’s time now to practice this!
For your convenience, the names of the tables and columns of interest in this exercise are: state_fact (Table), notes (Column), and census_region_name (Column).
Instructions
- Build an update statement to update the notes column in the state_fact table to ‘The Wild West’. Save it as stmt.
- Use a where clause to filter for records that have ‘West’ in the census_region_name column of the state_fact table.
- Execute stmt via the connection and save the output as results.
- Hit ‘Submit Answer’ to print rowcount of the results.
1
2
3
4
5
6
7
8
| engine = create_engine(census_sql_data)
print('Engine Table Names: \n', engine.table_names())
connection = engine.connect()
metadata = MetaData()
state_fact = Table('state_fact', metadata, autoload=True, autoload_with=engine)
print('\nTest Data:')
state_fact
|
1
2
3
4
5
6
7
8
9
| # Build a statement to update the notes to 'The Wild West': stmt
stmt = update(state_fact).values(notes = 'The Wild West')
# Append a where clause to match the West census region records
stmt = stmt.where(state_fact.columns.census_region_name == 'West')
# Execute the statement: results
results = connection.execute(stmt)
# Print rowcount
print(results.rowcount)
connection.close()
|
You can also update records with data from a select statement. This is called a correlated update. It works by defining a select statement that returns the value you want to update the record with and assigning that as the value in an update statement.
You’ll be using a flat_census in this exercise as the target of your correlated update. The flat_census table is a summarized copy of your census table.
Instructions
- Build a statement to select the name column from state_fact. Save the statement as fips_stmt.
- Append a where clause to fips_stmt that matches fips_state from the state_fact table with fips_code in the flat_census table.
- Build an update statement to set the state_name in flat_census to fips_stmt. Save the statement as update_stmt.
- Hit ‘Submit Answer’ to execute update_stmt, store the results and print the rowcount of results.
Create flat_census table
1
2
3
4
5
6
7
8
9
10
11
12
13
| metadata = MetaData()
# Define a new table with a name, count, amount, and valid column: data
flat_census = Table('flat_census', metadata,
Column('state_name', String(255)),
Column('fips_code', String(255)),
extend_existing=True)
engine = create_engine(census_sql_data)
connection = engine.connect()
# Use the metadata to create the table
metadata.create_all(engine)
# Print the table details
print(repr(metadata.tables['flat_census']))
|
Copy fips_state values from the state_fact table
1
2
3
4
5
| state_fact = Table('state_fact', metadata, autoload=True, autoload_with=engine)
results = connection.execute(select([state_fact.columns.fips_state])).fetchall()
print(results[:5])
print(len(results))
|
Create a df with results
1
2
| df = pd.DataFrame(results, columns=['fips_code'])
df.head()
|
Write the df to fips_code in the flat_census table
1
| df.to_sql('flat_census', con=engine, index=False, if_exists='append')
|
Complete the example
1
2
3
4
5
6
7
8
9
10
11
| # Build a statement to select name from state_fact: stmt
fips_stmt = select([state_fact.columns.name])
# Append a where clause to Match the fips_state to flat_census fips_code
fips_stmt = fips_stmt.where(state_fact.columns.fips_state == flat_census.columns.fips_code)
# Build an update statement to set the name to fips_stmt: update_stmt
update_stmt = update(flat_census).values(state_name = fips_stmt)
# Execute update_stmt: results
results = connection.execute(update_stmt)
# Print rowcount
print(results.rowcount)
connection.close()
|
Examples - Updated
The previous examples have been updated to function with sqlalchemy 2.0.29
.
Example Notes
Database and Table Setup:
- Create a database connection using
create_engine
with specific connection arguments to handle multithreading and timeout settings. - Define a
flat_census
table within a metadata object, specifying columns for state names and FIPS codes. Design this table to either be created or have its existing schema extended.
Reflection and Table Access:
- Use
metadata.reflect()
to load existing table definitions from the database into the SQLAlchemy context. This enables interaction with previously defined tables such as state_fact
.
Basic Data Operations:
- Perform a basic data retrieval operation using a
select
statement filtered by the state name (‘New York’). Execute the query within a session and fetch results. - Update data by changing the FIPS state code for New York to 36. This operation illustrates the transactional nature of the session, which requires a commit to save changes.
Advanced Data Operations - Correlated Update:
- Construct a correlated subquery to update the
flat_census
table. Explain how the subquery fetches the state name from state_fact
where FIPS codes match between state_fact
and flat_census
. - Execute the update operation to align the
state_name
in flat_census
based on the results of the subquery. Emphasize the importance of committing the session to apply changes.
Commitment and Closure:
- Commit the session after executing database operations to ensure all changes are saved. Then close the session to release database resources.
Summary:
- The instructions cover creating and configuring a database engine, defining and reflecting tables, and performing both simple and complex database operations.
Code Synopsis
- Set up the database connection using
create_engine
with specific parameters for multithreading and timeout. - Define and create the
flat_census
table if it does not already exist, with columns for state_name
and fips_code
, and ensure it can extend an existing schema. - Reflect all existing tables from the database to ensure they are loaded into SQLAlchemy’s metadata.
- Access the
state_fact
table from the reflected metadata. - Open a session and perform several operations within it:
- Retrieve and print records for New York from the
state_fact
table. - Update the FIPS state code for New York to 36, commit the session, and print the number of rows updated.
- Print updated records for New York to confirm the update.
- Update the
notes
field for records where the census region is ‘West’ to ‘The Wild West’, commit the session, and print the number of rows updated. - Retrieve and display the first five FIPS states from the
state_fact
table. - Export FIPS states to the
flat_census
table using a DataFrame and the to_sql
method. - Create and execute a correlated subquery to update the
state_name
in the flat_census
table based on matching FIPS codes between state_fact
and flat_census
. - Commit the session after updates and print the number of rows affected by the correlated update.
- Close the session to release resources.
Full Script
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
| # Database connection string and engine setup
engine = create_engine(census_sql_data, connect_args={'check_same_thread': False, 'timeout': 15})
metadata = MetaData()
# Define the 'flat_census' table if it doesn't exist
flat_census = Table('flat_census', metadata,
Column('state_name', String(255)),
Column('fips_code', String(255)),
extend_existing=True)
# Create or extend the table in the database
metadata.create_all(engine)
# Reflect the existing tables to ensure all are loaded
metadata.reflect(bind=engine)
# Accessing the tables
state_fact = metadata.tables['state_fact']
with Session(engine) as session:
# Operations on state_fact
select_stmt = select(state_fact).where(state_fact.c.name == 'New York')
new_york_records = session.execute(select_stmt).fetchall()
print('New York Records:', new_york_records)
update_stmt = update(state_fact).where(state_fact.c.name == 'New York').values(fips_state=36)
result = session.execute(update_stmt)
session.commit()
print('Rows updated:', result.rowcount)
updated_records = session.execute(select_stmt).fetchall()
print('Updated New York Records:', updated_records)
update_stmt_west = update(state_fact).where(state_fact.c.census_region_name == 'West').values(notes='The Wild West')
result_west = session.execute(update_stmt_west)
session.commit()
print('Rows updated in West:', result_west.rowcount)
fips_results = session.execute(select(state_fact.c.fips_state)).fetchall()
print('FIPS States:', fips_results[:5])
# Exporting to flat_census using DataFrame
df = pd.DataFrame(fips_results, columns=['fips_code'])
display(df.head())
df.to_sql('flat_census', con=engine, index=False, if_exists='append', method='multi')
# Build a statement to select name from state_fact where fips_state matches flat_census fips_code
fips_stmt = select(state_fact.c.name).\
where(state_fact.c.fips_state == flat_census.c.fips_code).\
scalar_subquery()
# Build an update statement to set the state_name in flat_census to the name selected from state_fact
update_stmt = update(flat_census).values(state_name=fips_stmt)
# Execute update_stmt using the session
result = session.execute(update_stmt)
session.commit()
# Print rowcount
print(result.rowcount)
|
1
2
3
4
5
| New York Records: [('32', 'New York', 'NY', 'USA', 'state', '10', 'current', 'occupied', '', '36', 'N.Y.', 'II', '1', 'Northeast', '2', 'Mid-Atlantic', '2')]
Rows updated: 1
Updated New York Records: [('32', 'New York', 'NY', 'USA', 'state', '10', 'current', 'occupied', '', '36', 'N.Y.', 'II', '1', 'Northeast', '2', 'Mid-Atlantic', '2')]
Rows updated in West: 13
FIPS States: [('17',), ('34',), ('38',), ('41',), ('11',)]
|
Removing Data From a Database
Deleting Data from a Table
- Done with the delete() statement
- delete() takes the table we are loading data into as the argument
- A where() clause is used to choose which rows to delete
- Hard to undo so BE CAREFUL!!!
Deleting all Data from a Table
1
2
3
4
5
6
7
8
9
| from sqlalchemy import delete
stmt = select(func.count(extra_employees.columns.id))
connection.execute(stmt).scalar()
Out: 3
delete_stmt = delete(extra_employees)
result_proxy = connection.execute(delete_stmt)
result_proxy.rowcount
Out: 3
|
Deleting Specific Rows
- Build a where clause that will select all the records you want to delete
1
2
3
4
| stmt = delete(employees).where(employees.columns.id == 3)
result_proxy = connection.execute(stmt)
result_proxy.rowcount
Out: 1
|
Dropping a Table Completely
- Uses the drop method on the table
- Accepts the engine as an argument so it knows where to remove the table from
- Won’t remove it from metadata until the python process is restarted
1
2
3
| extra_employees.drop(engine)
print(extra_employees.exists(engine))
Out: False
|
Dropping all the Tables
- Uses the drop_all() method on MetaData
1
2
3
| metadata.drop_all(engine)
engine.table_names()
Out: []
|
Examples
Deleting all the records from a table
Often, you’ll need to empty a table of all of its records so you can reload the data. You can do this with a delete statement with just the table as an argument. For example, in the video, Jason deleted the table extra_employees by executing as follows:
1
2
| delete_stmt = delete(extra_employees)
result_proxy = connection.execute(delete_stmt)
|
Do be careful, though, as deleting cannot be undone!
Instructions
- Import delete and select from sqlalchemy.
- Build a delete statement to remove all the data from the census table. Save it as stmt.
- Execute stmt via the connection and save the results.
- Hit ‘Submit Answer’ to select all remaining rows from the census table and print the result to confirm that the table is now empty!**
Connect to the database and create a new table (census2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| engine = create_engine(census_sql_data)
inspector = inspect(engine)
table_names = inspector.get_table_names()
print('Engine Table Names: \n', table_names)
connection = engine.connect()
metadata = MetaData()
metadata.reflect(bind=engine)
census = metadata.tables['census']
results = connection.execute(select(census)).fetchall()
df = pd.DataFrame(results)
df.columns = ['state', 'sex', 'age', 'pop2000', 'pop2008']
_dtypes = {'state': String(30), 'sex': String(1), 'age': Integer(), 'pop2000': Integer(), 'pop2008': Integer()}
df.to_sql('census2', con=engine, index=False, if_exists='replace', dtype=_dtypes)
|
1
2
3
4
5
6
7
8
| Engine Table Names:
['census', 'flat_census', 'state_fact']
8772
|
Complete the example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| census2 = Table('census2', metadata, autoload_with=engine)
# Build a statement to empty the census table: stmt
stmt = delete(census2)
# Execute the statement: results
results = connection.execute(stmt)
# Print affected rowcount
print(results.rowcount)
# Build a statement to select all records from the census table
stmt = select(census2)
# Print the results of executing the statement to verify there are no rows
print(connection.execute(stmt).fetchall())
connection.close()
|
Deleting specific records
By using a where() clause, you can target the delete statement to remove only certain records. For example, Jason deleted all rows from the employees table that had id 3 with the following delete statement:
1
| delete(employees).where(employees.columns.id == 3)
|
Here you’ll delete ALL rows which have ‘M’ in the sex column and 36 in the age column. We have included code at the start which computes the total number of these rows. It is important to make sure that this is the number of rows that you actually delete.
Instructions
- Build a delete statement to remove data from the census table. Save it as stmt_del.
- Append a where clause to stmt_del that contains an and_ to filter for rows which have ‘M’ in the sex column AND 36 in the age column.
- Execute the delete statement.
- Hit ‘Submit Answer’ to print the rowcount of the results, as well as to_delete, which returns the number of rows that should be deleted. These should match and this is an important sanity check!
Connect to the database and create a new table (census2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| engine = create_engine(census_sql_data)
inspector = inspect(engine)
table_names = inspector.get_table_names()
print('Engine Table Names: \n', table_names)
connection = engine.connect()
metadata = MetaData()
metadata.reflect(bind=engine)
census = metadata.tables['census']
results = connection.execute(select(census)).fetchall()
df = pd.DataFrame(results)
df.columns = ['state', 'sex', 'age', 'pop2000', 'pop2008']
_dtypes = {'state': String(30), 'sex': String(1), 'age': Integer(), 'pop2000': Integer(), 'pop2008': Integer()}
df.to_sql('census2', con=engine, index=False, if_exists='replace', dtype=_dtypes)
|
1
2
3
4
5
6
7
8
| Engine Table Names:
['census', 'census2', 'flat_census', 'state_fact']
8772
|
Complete the example
1
2
3
4
5
6
7
| census2 = Table('census2', metadata, autoload_with=engine)
# Build a statement to count records using the sex column for Men ('M') age 36: stmt
stmt = select(func.count(census2.columns.sex)).where(and_(census2.columns.sex == 'M', census2.columns.age == 36))
# Execute the select statement and use the scalar() fetch method to save the record count
to_delete = connection.execute(stmt).scalar()
|
1
2
3
4
5
6
7
8
9
10
11
12
| # Build a statement to delete records from the census table: stmt_del
stmt_del = delete(census2)
# Append a where clause to target Men ('M') age 36
stmt_del = stmt_del.where(and_(census2.columns.sex == 'M', census2.columns.age == 36))
# Execute the statement: results
results = connection.execute(stmt_del)
# Print affected rowcount and to_delete record count, make sure they match
print(results.rowcount, to_delete)
connection.close()
|
Deleting a Table Completely
You’re now going to practice dropping individual tables from a database with the .drop() method, as well as all tables in a database with the .drop_all() method!
As Spider-Man’s Uncle Ben (as well as Jason, in the video!) said: With great power, comes great responsibility. Do be careful when deleting tables, as it’s not simple or fast to restore large databases! Remember, you can check to see if a table exists with the .exists() method.
This is the final exercise in this chapter: After this, you’ll be ready to apply everything you’ve learned to a case study in the final chapter of this course!
Instructions
- Drop the state_fact table by applying the method .drop() to it and passing it the argument engine (in fact, engine will be the sole argument for every function/method in this exercise!)
- Check to see if state_fact exists via print. Use the .exists() method with engine as the argument.
- Drop all the tables via the metadata using the .drop_all() method.
- Use a print statement to check if the census table exists.
Connect to the database and create a new table (state_fact2)
1
2
3
4
| engine = create_engine(census_sql_data)
inspector = inspect(engine)
table_names = inspector.get_table_names()
print('Engine Table Names: \n', table_names)
|
1
2
| Engine Table Names:
['census', 'census2', 'flat_census', 'state_fact']
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| connection = engine.connect()
metadata = MetaData()
metadata.reflect(bind=engine)
state_fact = metadata.tables['state_fact']
results = connection.execute(select(state_fact)).fetchall()
df = pd.DataFrame(results)
_dtypes = {'id': String(256),
'name': String(256),
'abbreviation': String(256),
'country': String(256),
'type': String(256),
'sort': String(256),
'status': String(256),
'occupied': String(256),
'notes': String(256),
'fips_state': String(256),
'assoc_press': String(256),
'standard_federal_region': String(256),
'census_region': String(256),
'census_region_name': String(256),
'census_division': String(256),
'census_division_name': String(256),
'circuit_court': String(256)}
df.columns = _dtypes.keys()
df.to_sql('state_fact2', con=engine, index=False, if_exists='replace', dtype=_dtypes)
|
Complete the example
1
| state_fact2 = Table('state_fact2', metadata, autoload_with=engine)
|
1
2
| # Drop the state_fact table
state_fact2.drop(engine)
|
1
2
3
4
5
6
7
8
9
| # Check to see if state_fact exists
# state_fact2.exists(engine) # existings is no longer an attribute
# Check if the table exists in the database
table_name = 'state_fact2'
if state_fact2 in inspector.get_table_names():
print(f"The table '{table_name}' exists.")
else:
print(f"The table '{table_name}' does not exist.")
|
1
| The table 'state_fact2' does not exist.
|
- Make a backup copy of your census db
- Recreate state_fact2
1
2
3
4
5
6
7
8
9
| # Drop all tables
metadata.drop_all(engine)
# Check to see if census exists
table_name = 'state_fact2'
if state_fact2 in inspector.get_table_names():
print(f"The table '{table_name}' exists.")
else:
print(f"The table '{table_name}' does not exist.")
|
1
| The table 'state_fact2' does not exist.
|
Putting it all together
Here, you will bring together all of the skills you acquired in the previous chapters to work on a real life project! From connecting to a database, to populating it, to reading and querying it, you will have a chance to apply all the key concepts you learned in this course.
Census Case Study
Census Case Study
- Preparing SQLAlchemy and the Database
- Loading Data into the Database
- Solving Data Science Problems with Queries
Part 1: Preparing SQLAlchemy and the Database
- Create an Engine and MetaData object
1
2
3
| from sqlalchemy import create_engine, MetaData
engine = create_engine('sqlite:///census_nyc.sqlite')
metadata = MetaData()
|
- Create and save the census table
1
2
3
4
5
6
7
| from sqlalchemy import (Table, Column, String, Integer, Decimal, Boolean)
employees = Table('employees', metadata,
Column('id', Integer()),
Column('name', String(255)),
Column('salary', Decimal()),
Column('active', Boolean()))
metadata.create_all(engine)
|
Case Study Explorations
In this exercise, your job is to create an engine to the database that will be used in this chapter. Then, you need to initialize its metadata.
Recall how you did this in Chapter 1 by leveraging create_engine() and MetaData.
Instructions
- Import create_engine and MetaData from sqlalchemy.
- Create an engine to the chapter 5 database by using ‘sqlite:///chapter5.sqlite’ as the connection string.
- Create a MetaData object as metadata.
1
2
3
4
5
6
7
8
| # Import create_engine, MetaData
# from sqlalchemy import create_engine, MetaData -> done at top of notebook
# Define an engine to connect to chapter5.sqlite: engine
engine = create_engine('sqlite:///chapter5.sqlite')
# Initialize MetaData: metadata
metadata = MetaData()
|
Create the Table to the Database
Having setup the engine and initialized the metadata, you will now define the census table object and then create it in the database using the metadata and engine from the previous exercise. To create it in the database, you will have to use the .create_all() method on the metadata with engine as the argument.
It may help to refer back to the Chapter 4 exercise in which you learned how to create a table.
Instructions
- Import Table, Column, String, and Integer from sqlalchemy.
- Define a census table with the following columns:
- ‘state’ - String - length of 30
- ‘sex’ - String - length of 1
- ‘age’ - Integer
- ‘pop2000’ - Integer
- ‘pop2008’ - Integer
- Create the table in the database using the metadata and engine.
1
2
3
4
5
6
7
8
9
10
11
12
13
| # Import Table, Column, String, and Integer
# from sqlalchemy import Table, Column, String, Integer -> done at top of notebook
# Build a census table: census
census = Table('census', metadata,
Column('state', String(30)),
Column('sex', String(1)),
Column('age', Integer()),
Column('pop2000', Integer()),
Column('pop2008', Integer()))
# Create the table in the database
metadata.create_all(engine)
|
Populating the Database
Part 2: Populating the Database
- Load a CSV file into a values list
1
2
3
4
5
6
| values_list = []
for row in csv_reader:
data = {'state': row[0], 'sex': row[1],
'age': row[2], 'pop2000': row[3],
'pop2008': row[4]}
values_list.append(data)
|
- Insert the values list into the census table
1
2
3
4
5
| from sqlalchemy import insert
stmt = insert(employees)
result_proxy = connection.execute(stmt, values_list)
print(result_proxy.rowcount)
Out: 2
|
Case Study Explorations
Reading the Data from the CSV
Leverage the Python CSV module from the standard library and load the data into a list of dictionaries.
It may help to refer back to the Chapter 4 exercise in which you did something similar.
Instructions
- Create an empty list called values_list.
- Iterate over the rows of csv_reader with a for loop, creating a dictionary called data for each row and append it to values_list.
- Within the for loop, row will be a list whose entries are ‘state’ , ‘sex’, ‘age’, ‘pop2000’ and ‘pop2008’ (in that order).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # Create an empty list: values_list
values_list = list()
with open(census_csv_data, newline='\n') as csvfile:
csv_reader = csv.reader(csvfile)
# Iterate over the rows
for row in csv_reader:
# Create a dictionary with the values
data = {'state': row[0],
'sex': row[1],
'age':row[2],
'pop2000': row[3],
'pop2008': row[4]}
# Append the dictionary to the values list
values_list.append(data)
|
Load Data from a list into the Table
Using the multiple insert pattern, in this exercise, you will load the data from values_list into the table.
Instructions
- Import insert from sqlalchemy.
- Build an insert statement for the census table.
- Execute the statement stmt along with values_list. You will need to pass them both as arguments to connection.execute().
- Print the rowcount attribute of results.
1
| connection = engine.connect()
|
1
2
3
4
5
6
7
8
9
10
11
12
| # Import insert
# from sqlalchemy import insert -> done at top of notebook
# Build insert statement: stmt
stmt = insert(census)
# Use values_list to insert data: results
connection = engine.connect()
results = connection.execute(stmt, values_list)
# Print rowcount
results.rowcount
|
Example Queries
Part 3: Answering Data Science Questions with Queries
- Determine Average Age for Males and Females
1
2
3
4
5
6
| from sqlalchemy import select
stmt = select(census.columns.sex,
(func.sum(census.columns.pop2008 * census.columns.age) /
func.sum(census.columns.pop2008)).label('average_age'))
stmt = stmt.group_by('census.columns.sex')
results = connection.execute(stmt).fetchall()
|
- Determine the percentage of Females for each state
1
2
3
4
| from sqlalchemy import case, cast, Float
stmt = select((func.sum(...: case((census.columns.state == 'New York',
census.columns.pop2008)...: , else_=0)) /
cast(func.sum(census.columns.pop2008), Float) * 100).label('ny_percent'))
|
- Determine the top 5 states by population change from 2000 to 2008
1
2
3
4
| stmt = select(census.columns.age,
(census.columns.pop2008 - census.columns.pop2000).label('pop_change'))
stmt = stmt.order_by('pop_change')
stmt = stmt.limit(5)
|
Case Study Explorations
Building a Query to Determine the Average Age by Population
In this exercise, you will use the func.sum() and group_by() methods to first determine the average age weighted by the population in 2008, and then group by sex.
As Jason discussed in the video, a weighted average is calculated as the sum of the product of the weights and averages divided by the sum of all the weights.
For example, the following statement determines the average age weighted by the population in 2000:
1
2
3
4
| stmt = select(census.columns.sex,
(func.sum(census.columns.pop2000 * census.columns.age) /
func.sum(census.columns.pop2000)).label('average_age')
)
|
Instructions
- Import select from sqlalchemy.
- Build a statement to:
- Select sex from the census table.
- Select the average age weighted by the population in 2008 (pop2008). See the example given in the assignment text to see how you can do this. Label this average age calculation as ‘average_age’.
- Group the query by sex.
- Execute the query and store it as results.
- Loop over results and print the sex and average_age for each record.
1
2
3
4
5
6
7
8
9
10
11
12
13
| # Import select
# from sqlalchemy import select -> done at top of notebook
# Calculate weighted average age: stmt
stmt = select(census.columns.sex,
(func.sum(census.columns.pop2008 * census.columns.age) /
func.sum(census.columns.pop2008)).label('average_age'))
# Group by sex
stmt = stmt.group_by('sex')
# Execute the query and store the results: results
results = connection.execute(stmt).fetchall()
|
1
| [('F', Decimal('38.0855587985')), ('M', Decimal('35.7836038586'))]
|
1
2
3
| # Print the average age by sex
for result in results:
print(result[0], result[1])
|
1
2
| F 38.0855587985
M 35.7836038586
|
Build a Query to Determine the Percentage of Population by Gender and State
In this exercise, you will write a query to determine the percentage of the population in 2000 that comprised of women. You will group this query by state.
Instructions
- Import case, cast and Float from sqlalchemy.
- Define a statement to select state and the percentage of females in 2000.
- Inside func.sum(), use case() to select females (using the sex column) from pop2000. Remember to specify else_=0 if the sex is not ‘F’.
- To get the percentage, divide the number of females in the year 2000 by the overall population in 2000. Cast the divisor - census.columns.pop2000 - to Float before multiplying by 100.
- Group the query by state.
- Execute the query and store it as results.
- Print state and percent_female for each record. This has been done for you, so hit ‘Submit Answer’ to see the result.
1
2
3
4
5
6
7
8
9
10
11
12
| # import case, cast and Float from sqlalchemy
# from sqlalchemy import case, cast, Float -> done at top of notebook
# Build a query to calculate the percentage of females in 2000: stmt
stmt = select(census.columns.state, (func.sum(case((census.columns.sex == 'F', census.columns.pop2000), else_=0)) /
cast(func.sum(census.columns.pop2000), Float) * 100).label('percent_female'))
# Group By state
stmt = stmt.group_by('state')
# Execute the query and store the results: results
results = connection.execute(stmt).fetchall()
|
1
2
3
| # Print the percentage
for result in results:
print(f'State: {result.state}, Percent Female: {result.percent_female}')
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
| State: Alabama, Percent Female: 51.832407770179465
State: Alaska, Percent Female: 49.301497893484594
State: Arizona, Percent Female: 50.22361303057914
State: Arkansas, Percent Female: 51.26992846221834
State: California, Percent Female: 50.35233214901979
State: Colorado, Percent Female: 49.84767060299562
State: Connecticut, Percent Female: 51.66816507130644
State: Delaware, Percent Female: 51.61109733558627
State: District of Columbia, Percent Female: 53.129626141738385
State: Florida, Percent Female: 51.36488001165242
State: Georgia, Percent Female: 51.11408350339436
State: Hawaii, Percent Female: 51.118011836915514
State: Idaho, Percent Female: 49.98972623903102
State: Illinois, Percent Female: 51.11224234802867
State: Indiana, Percent Female: 50.95480313297678
State: Iowa, Percent Female: 50.950398342534264
State: Kansas, Percent Female: 50.821864107754735
State: Kentucky, Percent Female: 51.32687036927168
State: Louisiana, Percent Female: 51.75351596554121
State: Maine, Percent Female: 51.50570813418951
State: Maryland, Percent Female: 51.93575549972231
State: Massachusetts, Percent Female: 51.843023571316785
State: Michigan, Percent Female: 50.97246518318712
State: Minnesota, Percent Female: 50.49332944301148
State: Mississippi, Percent Female: 51.92229481794672
State: Missouri, Percent Female: 51.46888602639692
State: Montana, Percent Female: 50.32202690728538
State: Nebraska, Percent Female: 50.8584549336086
State: Nevada, Percent Female: 49.36736361384359
State: New Hampshire, Percent Female: 50.858019844961746
State: New Jersey, Percent Female: 51.51713956125773
State: New Mexico, Percent Female: 51.0471720798335
State: New York, Percent Female: 51.83453865150073
State: North Carolina, Percent Female: 51.482262322084594
State: North Dakota, Percent Female: 50.50069363231332
State: Ohio, Percent Female: 51.46550350015544
State: Oklahoma, Percent Female: 51.11362457075227
State: Oregon, Percent Female: 50.4294670361772
State: Pennsylvania, Percent Female: 51.74043473051053
State: Rhode Island, Percent Female: 52.07343391902215
State: South Carolina, Percent Female: 51.73072129765755
State: South Dakota, Percent Female: 50.52583581373275
State: Tennessee, Percent Female: 51.430689699449275
State: Texas, Percent Female: 50.515721664207966
State: Utah, Percent Female: 49.97295275106927
State: Vermont, Percent Female: 51.018573209949466
State: Virginia, Percent Female: 51.657252447241795
State: Washington, Percent Female: 50.518565087218334
State: West Virginia, Percent Female: 51.40042318092286
State: Wisconsin, Percent Female: 50.61486452653393
State: Wyoming, Percent Female: 49.94595542648306
|
1
2
| results_df = pd.DataFrame(results, columns=['State', 'Percent_Female'])
results_df
|
| State | Percent_Female |
---|
0 | Alabama | 51.832408 |
---|
1 | Alaska | 49.301498 |
---|
2 | Arizona | 50.223613 |
---|
3 | Arkansas | 51.269928 |
---|
4 | California | 50.352332 |
---|
5 | Colorado | 49.847671 |
---|
6 | Connecticut | 51.668165 |
---|
7 | Delaware | 51.611097 |
---|
8 | District of Columbia | 53.129626 |
---|
9 | Florida | 51.364880 |
---|
10 | Georgia | 51.114084 |
---|
11 | Hawaii | 51.118012 |
---|
12 | Idaho | 49.989726 |
---|
13 | Illinois | 51.112242 |
---|
14 | Indiana | 50.954803 |
---|
15 | Iowa | 50.950398 |
---|
16 | Kansas | 50.821864 |
---|
17 | Kentucky | 51.326870 |
---|
18 | Louisiana | 51.753516 |
---|
19 | Maine | 51.505708 |
---|
20 | Maryland | 51.935755 |
---|
21 | Massachusetts | 51.843024 |
---|
22 | Michigan | 50.972465 |
---|
23 | Minnesota | 50.493329 |
---|
24 | Mississippi | 51.922295 |
---|
25 | Missouri | 51.468886 |
---|
26 | Montana | 50.322027 |
---|
27 | Nebraska | 50.858455 |
---|
28 | Nevada | 49.367364 |
---|
29 | New Hampshire | 50.858020 |
---|
30 | New Jersey | 51.517140 |
---|
31 | New Mexico | 51.047172 |
---|
32 | New York | 51.834539 |
---|
33 | North Carolina | 51.482262 |
---|
34 | North Dakota | 50.500694 |
---|
35 | Ohio | 51.465504 |
---|
36 | Oklahoma | 51.113625 |
---|
37 | Oregon | 50.429467 |
---|
38 | Pennsylvania | 51.740435 |
---|
39 | Rhode Island | 52.073434 |
---|
40 | South Carolina | 51.730721 |
---|
41 | South Dakota | 50.525836 |
---|
42 | Tennessee | 51.430690 |
---|
43 | Texas | 50.515722 |
---|
44 | Utah | 49.972953 |
---|
45 | Vermont | 51.018573 |
---|
46 | Virginia | 51.657252 |
---|
47 | Washington | 50.518565 |
---|
48 | West Virginia | 51.400423 |
---|
49 | Wisconsin | 50.614865 |
---|
50 | Wyoming | 49.945955 |
---|
Build a Query to Determine the Difference by State from the 2000 and 2008 Censuses
In this final exercise, you will write a query to calculate the states that changed the most in population. You will limit your query to display only the top 10 states.
Instructions
- Build a statement to:
- Select state.
- Calculate the difference in population between 2008 (pop2008) and 2000 (pop2000).
- Group the query by census.columns.state using the .group_by() method on stmt.
- Order by ‘pop_change’ in descending order using the .order_by() method with the desc() function on ‘pop_change’.
- Limit the query to the top 10 states using the .limit() method.
- Execute the query and store it as results.
- Print the state and the population change for each result. This has been done for you, so hit ‘Submit Answer’ to see the result!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # Build query to return state name and population difference from 2008 to 2000
stmt = select(census.columns.state,
(census.columns.pop2008 - census.columns.pop2000).label('pop_change'))
# Group by State
stmt = stmt.group_by(census.columns.state)
# Order by Population Change
stmt = stmt.order_by(desc('pop_change'))
# Limit to top 10
stmt = stmt.limit(10)
# Use connection to execute the statement and fetch all results
results = connection.execute(stmt).fetchall()
|
1
2
3
| # Print the state and population change for each record
for result in results:
print(f'{result.state}:{result.pop_change}')
|
1
2
3
4
5
6
7
8
9
10
| Texas:40137
California:35406
Florida:21954
Arizona:14377
Georgia:13357
North Carolina:11574
Virginia:6639
Colorado:6425
Utah:5934
Illinois:5412
|
1
2
| pop_change_df = pd.DataFrame(results, columns=['State', 'Population_Change'])
pop_change_df
|
| State | Population_Change |
---|
0 | Texas | 40137 |
---|
1 | California | 35406 |
---|
2 | Florida | 21954 |
---|
3 | Arizona | 14377 |
---|
4 | Georgia | 13357 |
---|
5 | North Carolina | 11574 |
---|
6 | Virginia | 6639 |
---|
7 | Colorado | 6425 |
---|
8 | Utah | 5934 |
---|
9 | Illinois | 5412 |
---|
Certificate