import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine, Table, MetaData, select, and_, desc, func, case, cast, Float
from sqlalchemy import Column, String, Integer, Boolean, insert, update, delete
import pymysql
from pprint import pprint as pp
import csv
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)
census_csv_data = 'data/intro_to_databases_in_python/census.csv'
census_sql_data = 'sqlite:///data/intro_to_databases_in_python/census.sqlite'
employees_sql_data = 'sqlite:///data/intro_to_databases_in_python/employees.sqlite'
Course Description
In this Python SQL course, you'll learn the basics of using Structured Query Language (SQL) with Python. This will be useful since whether you like it or not, databases are ubiquitous and, as a data scientist, you'll need to interact with them constantly. The Python SQL toolkit SQLAlchemy provides an accessible and intuitive way to query, build & write to SQLite, MySQL and Postgresql databases (among many others), all of which you will encounter in the daily life of a data scientist.
In this chapter, you will become acquainted with the fundamentals of Relational Databases and the Relational Model. You will learn how to connect to a database and then interact with it by writing basic SQL queries, both in raw SQL as well as with SQLAlchemy, which provides a Pythonic way of interacting with databases.
Which of the following is not part of the relational model?
Answer the question
In [1]: from sqlalchemy import create_engine
In [2]: engine = create_engine('sqlite:///census_nyc.sqlite')
In [3]: connection = engine.connect()
In [1]: from sqlalchemy import create_engine
In [2]: engine = create_engine('sqlite:///census_nyc.sqlite')
In [3]: print(engine.table_names())
Out[3]: ['census', 'state_fact']
In [1]: from sqlalchemy import MetaData, Table
In [2]: metadata = MetaData()
In [3]: census = Table('census', metadata, autoload=True, autoload_with=engine)
In [4]: print(repr(census))
Out[4]:
Table('census', MetaData(bind=None), Column('state',
VARCHAR(length=30), table=<census>), Column('sex',
VARCHAR(length=1), table=<census>), Column('age', INTEGER(),
table=<census>), Column('pop2000', INTEGER(), table=<census>),
Column('pop2008', INTEGER(), table=<census>), schema=None)
Alright, it's time to create your first engine! An engine is just a common interface to a database, and the information it requires to connect to one is contained in a connection string, such as sqlite:///census_nyc.sqlite. Here, sqlite is the database driver, while census_nyc.sqlite is a SQLite file contained in the local directory.
You can learn a lot more about connection strings in the SQLAlchemy documentation.
Your job in this exercise is to create an engine that connects to a local SQLite file named census.sqlite. Then, print the names of the tables it contains using the .table_names() method. Note that when you just want to print the table names, you do not need to use engine.connect() after creating the engine.
Instructions
# Import create_engine - at top of notebook
# Create an engine that connects to the census.sqlite file: engine
engine = create_engine(census_sql_data)
# Print table names
engine.table_names()
SQLAlchemy can be used to automatically load tables from a database using something called reflection. Reflection is the process of reading the database and building the metadata based on that information. It's the opposite of creating a Table by hand and is very useful for working with existing databases. To perform reflection, you need to import the Table object from the SQLAlchemy package. Then, you use this Table object to read your table from the engine and autoload the columns. Using the Table object in this manner is a lot like passing arguments to a function. For example, to autoload the columns with the engine, you have to specify the keyword arguments autoload=True and autoload_with=engine to Table().
In this exercise, your job is to reflect the census table available on your engine into a variable called census. The metadata has already been loaded for you using MetaData() and is available in the variable metadata.
Instructions
# Import Table - at top of Notebook
# Reflect census table from the engine: census
metadata = MetaData()
census = Table('census', metadata, autoload=True, autoload_with=engine)
# Print census table metadata
repr(census)
Great job reflecting the census table! Now you can begin to learn more about the columns and structure of your table. It is important to get an understanding of your database by examining the column names. This can be done by using the .columns attribute and accessing the .keys() method. For example, census.columns.keys() would return a list of column names of the census table.
Following this, we can use the metadata container to find out more details about the reflected table such as the columns and their types. For example, table objects are stored in the metadata.tables dictionary, so you can get the metadata of your census table with metadata.tables['census']. This is similar to your use of the repr() function on the census table from the previous exercise.
Instructions
# Reflect the census table from the engine: census
census = Table('census', metadata, autoload=True, autoload_with=engine)
# Print the column names
census.columns.keys()
# Print full table metadata
repr(metadata.tables['census'])
● SELECT column_name FROM table_name
● SELECT pop2008 FROM People
● SELECT * FROM People
In [1]: from sqlalchemy import create_engine
In [2]: engine = create_engine('sqlite:///census_nyc.sqlite')
In [3]: connection = engine.connect()
In [4]: stmt = 'SELECT * FROM people'
In [5]: result_proxy = connection.execute(stmt)
In [6]: results = result_proxy.fetchall()
In [5]: result_proxy = connection.execute(stmt)
In [6]: results = result_proxy.fetchall()
In [1]: first_row = results[0]
In [2]: print(first_row)
Out[2]: ('Illinois', 'M', 0, 89600, 95012)
In [4]: print(first_row.keys())
Out[4]: ['state', 'sex', 'age', 'pop2000', 'pop2008']
In [6]: print(first_row.state)
Out[6]: 'Illinois'
In [4]: from sqlalchemy import Table, MetaData
In [5]: metadata = MetaData()
In [6]: census = Table('census', metadata, autoload=True, autoload_with=engine)
In [7]: stmt = select([census])
In [8]: results = connection.execute(stmt).fetchall()
In [9]: stmt = select([census])
In [10]: print(stmt)
Out[10]: 'SELECT * from CENSUS'
Using what we just learned about SQL and applying the .execute() method on our connection, we can leverage a raw SQL query to query all the records in our census table. The object returned by the .execute() method is a ResultProxy. On this ResultProxy, we can then use the .fetchall() method to get our results - that is, the ResultSet.
In this exercise, you'll use a traditional SQL query. In the next exercise, you'll move to SQLAlchemy and begin to understand its advantages. Go for it!
Instructions
engine = create_engine(census_sql_data)
connection = engine.connect()
# Build select statement for census table: stmt
stmt = 'SELECT * FROM census'
# Execute the statement and fetch the results: results
results = connection.execute(stmt).fetchall()
results[:5]
It's now time to build your first select statement using SQLAlchemy. SQLAlchemy provides a nice "Pythonic" way of interacting with databases. So rather than dealing with the differences between specific dialects of traditional SQL such as MySQL or PostgreSQL, you can leverage the Pythonic framework of SQLAlchemy to streamline your workflow and more efficiently query your data. For this reason, it is worth learning even if you may already be familiar with traditional SQL.
In this exercise, you'll once again build a statement to query all records from the census table. This time, however, you'll make use of the select() function of the sqlalchemy module. This function requires a list of tables or columns as the only required argument.
Table and MetaData have already been imported. The metadata is available as metadata and the connection to the database as connection.
Instructions
# Import select - at top of Notebook
engine = create_engine(census_sql_data)
connection = engine.connect()
metadata = MetaData()
# Reflect census table via engine: census
census = Table('census', metadata, autoload=True, autoload_with=engine)
# Build select statement for census table: stmt
stmt = select([census])
# Print the emitted statement to see the SQL emitted
print(stmt)
# Execute the statement and print the results
results = connection.execute(stmt).fetchall()
results[:5]
Recall the differences between a ResultProxy and a ResultSet:
This separation between the ResultSet and ResultProxy allows us to fetch as much or as little data as we desire.
Once we have a ResultSet, we can use Python to access all the data within it by column name and by list style indexes. For example, you can get the first row of the results by using results[0]. With that first row then assigned to a variable first_row, you can get data from the first column by either using first_row[0] or by column name such as first_row['column_name']. You'll now practice exactly this using the ResultSet you obtained from the census table in the previous exercise. It is stored in the variable results. Enjoy!
Instructions
# Get the first row of the results by using an index: first_row
first_row = results[0]
# Print the first row of the results
print(first_row)
# Print the first column of the first row by using an index
print(first_row[0])
# Print the 'state' column of the first row by using its name
print(first_row['state'])
In this chapter, you will build on the database knowledge you began acquiring in the previous chapter by writing more nuanced queries that allow you to filter, order, and count your data, all within the Pythonic framework provided by SQLAlchemy!
In [1]: stmt = select([census])
In [2]: stmt = stmt.where(census.columns.state ==
'California')
In [3]: results = connection.execute(stmt).fetchall()
In [4]: for result in results:
...: print(result.state, result.age)
Out[4]:
California 0
California 1
California 2
California 3
California 4
Calif
In [1]: stmt = select([census])
In [2]: stmt = stmt.where(census.columns.state.startswith('New'))
In [3]: for result in connection.execute(stmt):
print(result.state, result.pop2000)
Out[3]:
New Jersey 56983
New Jersey 56686
New Jersey 57011
...
In [1]: from sqlalchemy import or_
In [2]: stmt = select([census])
In [3]: stmt = stmt.where(or_(census.columns.state == 'California',
census.columns.state == 'New York'))
In [4]: for result in connection.execute(stmt):
print(result.state, result.sex)
Out[4]:
New York M
…
California F
In these exercises, you will be working with real databases hosted on the cloud via Amazon Web Services (AWS)!
Let's begin by connecting to a PostgreSQL database. When connecting to a PostgreSQL database, many prefer to use the psycopg2 database driver as it supports practically all of PostgreSQL's features efficiently and is the standard dialect for PostgreSQL in SQLAlchemy.
You might recall from Chapter 1 that we use the create_engine() function and a connection string to connect to a database.
There are three components to the connection string in this exercise: the dialect and driver ('postgresql+psycopg2://'), followed by the username and password ('student:datacamp'), followed by the host and port ('@postgresql.csrrinzqubik.us-east-1.rds.amazonaws.com:5432/'), and finally, the database name ('census'). You will have to pass this string as an argument to create_engine() in order to connect to the database.
Instructions
# Use with local file
engine = create_engine(census_sql_data)
connection = engine.connect()
metadata = MetaData()
census = Table('census', metadata, autoload=True, autoload_with=engine)
# Create an engine to the census database - exercise
# engine = create_engine('postgresql+psycopg2://student:datacamp@postgresql.csrrinzqubik.us-east-1.rds.amazonaws.com:5432/census')
# Use the .table_names() method on the engine to print the table names
print(engine.table_names())
Having connected to the database, it's now time to practice filtering your queries!
As mentioned in the video, a where() clause is used to filter the data that a statement returns. For example, to select all the records from the census table where the sex is Female (or 'F') we would do the following:
select([census]).where(census.columns.sex == 'F')
In addition to == we can use basically any python comparison operator (such as <=, !=, etc) in the where() clause.
Instructions
# Create a select query: stmt
stmt = select([census])
# Add a where clause to filter the results to only those for New York
stmt = stmt.where(census.columns.state == 'New York')
# Execute the query to retrieve all the data returned: results
results = connection.execute(stmt).fetchall()
# Loop over the results and print the age, sex, and pop2008
for i, result in enumerate(results):
if i < 7:
print(result.age, result.sex, result.pop2008)
results[:7]
In addition to standard Python comparators, we can also use methods such as in_() to create more powerful where() clauses. You can see a full list of expressions in the SQLAlchemy Documentation.
We've already created a list of some of the most densely populated states.
Instructions
In [2]: stmt = stmt.where(census.columns.state.startswith('New'))
In [3]: for result in connection.execute(stmt):
print(result.state, result.pop2000)
states = ['New York', 'California', 'Texas']
# Create a query for the census table: stmt
stmt = select([census])
# Append a where clause to match all the states in_ the list states
stmt = stmt.where(census.columns.state.in_(states))
# Loop over the ResultProxy and print the state and its population in 2000
for i, result in enumerate(connection.execute(stmt)):
if i < 7:
print(result.state, result.pop2000)
You're really getting the hang of this! SQLAlchemy also allows users to use conjunctions such as and_(), or_(), and not_() to build more complex filtering. For example, we can get a set of records for people in New York who are 21 or 37 years old with the following code:
select([census]).where(and_(census.columns.state == 'New York',
or_(census.columns.age == 21,
census.columns.age == 37)))
Instructions
# Build a query for the census table: stmt
stmt = select([census])
# Append a where clause to select only non-male records from California using and_
# The state of California with a non-male sex
stmt = stmt.where(and_(census.columns.state == 'California',
census.columns.sex != 'M'))
# Loop over the ResultProxy printing the age and sex
for i, result in enumerate(connection.execute(stmt)):
if i < 7:
print(result.age, result.sex)
In [1]: print(results[:10])
Out[1]: [('Illinois',), …]
In [3]: stmt = select([census.columns.state])
In [4]: stmt = stmt.order_by(census.columns.state)
In [5]: results = connection.execute(stmt).fetchall()
In [6]: print(results[:10])
Out[6]: [('Alabama',), …]
In [6]: print(results)
Out[6]: ('Alabama', 'M')
In [7]: stmt = select([census.columns.state, census.columns.sex])
In [8]: stmt = stmt.order_by(census.columns.state, census.columns.sex)
In [9]: results = connection.execute(stmt).first()
In [10]: print(results)
Out[10]:('Alabama', 'F')
('Alabama', 'F')
…
('Alabama', 'M')
# Use with local file
engine = create_engine(census_sql_data)
connection = engine.connect()
metadata = MetaData()
census = Table('census', metadata, autoload=True, autoload_with=engine)
To sort the result output by a field, we use the .order_by() method. By default, the .order_by() method sorts from lowest to highest on the supplied column. You just have to pass in the name of the column you want sorted to .order_by().
In the video, for example, Jason used stmt.order_by(census.columns.state) to sort the result output by the state column.
Instructions
# Build a query to select the state column: stmt
stmt = select([census.columns.state])
# Order stmt by the state column
stmt = stmt.order_by(census.columns.state)
# Execute the query and store the results: results
results = connection.execute(stmt).fetchall()
# Print the first 10 results
results[:10]
You can also use .order_by() to sort from highest to lowest by wrapping a column in the desc() function. Although you haven't seen this function in action, it generalizes what you have already learned.
Pass desc() (for "descending") inside an .order_by() with the name of the column you want to sort by. For instance, stmt.order_by(desc(table.columns.column_name)) sorts column_name in descending order.
Instructions
# Build a query to select the state column: stmt
stmt = select([census.columns.state])
# Order stmt by state in descending order: rev_stmt
rev_stmt = stmt.order_by(desc(census.columns.state))
# Execute the query and store the results: rev_results
rev_results = connection.execute(rev_stmt).fetchall()
# Print the first 10 rev_results
rev_results[:10]
We can pass multiple arguments to the .order_by() method to order by multiple columns. In fact, we can also sort in ascending or descending order for each individual column. Each column in the .order_by() method is fully sorted from left to right. This means that the first column is completely sorted, and then within each matching group of values in the first column, it's sorted by the next column in the .order_by() method. This process is repeated until all the columns in the .order_by() are sorted.
Instructions
# Build a query to select state and age: stmt
stmt = select([census.columns.state, census.columns.age])
# Append order by to ascend by state and descend by age
stmt = stmt.order_by(census.columns.state, desc(census.columns.age))
# Execute the statement and store all the records: results
results = connection.execute(stmt).fetchall()
# Print the first 20 results
results[:20]
In [1]: from sqlalchemy import func
In [2]: stmt = select([func.sum(census.columns.pop2008)])
In [3]: results = connection.execute(stmt).scalar()
In [4]: print(results)
Out[4]: 302876613
In [1]: stmt = select([census.columns.sex, func.sum(census.columns.pop2008)])
In [2]: stmt = stmt.group_by(census.columns.sex)
In [3]: results = connection.execute(stmt).fetchall()
In [4]: print(results)
Out[4]: [('F', 153959198), ('M', 148917415)]
In [1]: stmt = select([census.columns.sex, census.columns.age,
func.sum(census.columns.pop2008)])
In [2]: stmt = stmt.group_by(census.columns.sex, census.columns.age)
In [2]: results = connection.execute(stmt).fetchall()
In [3]: print(results)
Out[3]:
[('F', 0, 2105442), ('F', 1, 2087705), ('F', 2, 2037280), ('F', 3,
2012742), ('F', 4, 2014825), ('F', 5, 1991082), ('F', 6, 1977923),
('F', 7, 2005470), ('F', 8, 1925725), …
In [1]: print(results[0].keys())
Out[1]: ['sex', u'sum_1']
In [2]: stmt = select([census.columns.sex,
func.sum(census.columns.pop2008).label( 'pop2008_sum')])
In [3]: stmt = stmt.group_by(census.columns.sex)
In [4]: results = connection.execute(stmt).fetchall()
In [5]: print(results[0].keys())
Out[5]: ['sex', 'pop2008_sum']
# Use with local file
engine = create_engine(census_sql_data)
connection = engine.connect()
metadata = MetaData()
census = Table('census', metadata, autoload=True, autoload_with=engine)
As mentioned in the video, SQLAlchemy's func module provides access to built-in SQL functions that can make operations like counting and summing faster and more efficient.
In the video, Jason used func.sum() to get a sum of the pop2008 column of census as shown below:
select([func.sum(census.columns.pop2008)])
If instead you want to count the number of values in pop2008, you could use func.count() like this:
select([func.count(census.columns.pop2008)])
Furthermore, if you only want to count the distinct values of pop2008, you can use the .distinct() method:
select([func.count(census.columns.pop2008.distinct())])
In this exercise, you will practice using func.count() and .distinct() to get a count of the distinct number of states in census.
So far, you've seen .fetchall() and .first() used on a ResultProxy to get the results. The ResultProxy also has a method called .scalar() for getting just the value of a query that returns only one row and column.
This can be very useful when you are querying for just a count or sum.
Instructions
# Build a query to count the distinct states values: stmt
stmt = select([func.count(census.columns.state.distinct())])
# Execute the query and store the scalar result: distinct_state_count
distinct_state_count = connection.execute(stmt).scalar()
# Print the distinct_state_count
distinct_state_count
Often, we want to get a count for each record with a particular value in another column. The .group_by() method helps answer this type of query. You can pass a column to the .group_by() method and use in an aggregate function like sum() or count(). Much like the .order_by() method, .group_by() can take multiple columns as arguments.
Instructions
# Build a query to select the state and count of ages by state: stmt
stmt = select([census.columns.state, func.count(census.columns.age)])
# Group stmt by state
stmt = stmt.group_by(census.columns.state)
# Execute the statement and store all the records: results
results = connection.execute(stmt).fetchall()
# Print results
results[:5]
# Print the keys/column names of the results returned
results[0].keys()
To avoid confusion with query result column names like count_1, we can use the .label() method to provide a name for the resulting column. This gets appendedto the function method we are using, and its argument is the name we want to use.
We can pair func.sum() with .group_by() to get a sum of the population by State and use the label() method to name the output.
We can also create the func.sum() expression before using it in the select statement. We do it the same way we would inside the select statement and store it in a variable. Then we use that variable in the select statement where the func.sum() would normally be.
Instructions
# Build an expression to calculate the sum of pop2008 labeled as population
pop2008_sum = func.sum(census.columns.pop2008).label('population')
# Build a query to select the state and sum of pop2008: stmt
stmt = select([census.columns.state, pop2008_sum])
# Group stmt by state
stmt = stmt.group_by(census.columns.state)
# Execute the statement and store all the records: results
results = connection.execute(stmt).fetchall()
# Print results
results[:5]
# Print the keys/column names of the results returned
results[0].keys()
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(results)
In [3]: df.columns = results[0].keys()
In [4]: print(df)
Out[4]:
sex pop2008_sum
0 F 2105442
1 F 2087705
2 F 2037280
3 F 2012742
4 F 2014825
5 F 1991082
In [1]: import matplotlib.pyplot as plt
In [2]: df[10:20].plot.barh()
In [3]: plt.show()
# Use with local file
engine = create_engine(census_sql_data)
connection = engine.connect()
metadata = MetaData()
census = Table('census', metadata, autoload=True, autoload_with=engine)
# Build an expression to calculate the sum of pop2008 labeled as population
pop2008_sum = func.sum(census.columns.pop2008).label('population')
# Build a query to select the state and sum of pop2008: stmt
stmt = select([census.columns.state, pop2008_sum])
stmt = stmt.order_by(desc(pop2008_sum))
# Group stmt by state
stmt = stmt.group_by(census.columns.state)
# Execute the statement and store all the records: results
results = connection.execute(stmt).fetchall()
# Print results
results[:5]
We can feed a ResultProxy directly into a pandas DataFrame, which is the workhorse of many Data Scientists in PythonLand. Jason demonstrated this in the video. In this exercise, you'll follow exactly the same approach to convert a ResultProxy into a DataFrame.
Instructions
# import pandas
# Create a DataFrame from the results: df
df = pd.DataFrame(results)
# Set column names
df.columns = results[0].keys()
# Print the Dataframe
df.head()
We can also take advantage of pandas and Matplotlib to build figures of our data. Remember that data visualization is essential for both exploratory data analysis and communication of your data!
Instructions
# Import pyplot as plt from matplotlib
# Create a DataFrame from the results: df
df = pd.DataFrame(results)
# Set Column names
df.columns = results[0].keys()
# Print the DataFrame
df.head()
# Plot the DataFrame
df.iloc[:5].plot.bar()
Herein, you will learn to perform advanced - and incredibly useful - queries that will enable you to interact with your data in powerful ways.
In [1]: stmt = select([census.columns.age,
(census.columns.pop2008 - census.columns.pop2000).label('pop_change')])
In [2]: stmt = stmt.group_by(census.columns.age)
In [3]: stmt = stmt.order_by(desc('pop_change'))
In [4]: stmt = stmt.limit(5)
In [5]: results = connection.execute(stmt).fetchall()
In [6]: print(results)
Out[6]: [(61, 52672), (85, 51901), (54, 50808), (58, 45575), (60,
44915)]
In [1]: from sqlalchemy import case
In [2]: stmt = select([func.sum(case([(census.columns.state == 'New York',
census.columns.pop2008)],
else_=0))])
In [3]: results = connection.execute(stmt).fetchall()
In [4]: print(results)
Out[4]:[(19465159,)]
In [1]: from sqlalchemy import case, cast, Float
In [2]: stmt = select([(func.sum(case([(census.columns.state == 'New York',
census.columns.pop2008)],
else_=0)) /
cast(func.sum(census.columns.pop2008), Float) * 100).label('ny_percent')])
In [3]: results = connection.execute(stmt).fetchall()
In [4]: print(results)
Out[4]: [(Decimal('6.4267619765'),)]
Before you jump into the calculation exercises, let's begin by connecting to our database. Recall that in the last chapter you connected to a PostgreSQL database. Now, you'll connect to a MySQL database, for which many prefer to use the pymysql database driver, which, like psycopg2 for PostgreSQL, you have to install prior to use.
This connection string is going to start with 'mysql+pymysql://', indicating which dialect and driver you're using to establish the connection. The dialect block is followed by the 'username:password' combo. Next, you specify the host and port with the following '@host:port/'. Finally, you wrap up the connection string with the 'database_name'.
Now you'll practice connecting to a MySQL database: it will be the same census database that you have already been working with. One of the great things about SQLAlchemy is that, after connecting, it abstracts over the type of database it has connected to and you can write the same SQLAlchemy code, regardless!
Instructions
# Use with local file
engine2 = create_engine(census_sql_data)
print('Engine Table Names: \n', engine2.table_names())
connection = engine2.connect()
metadata = MetaData()
census = Table('census', metadata, autoload=True, autoload_with=engine2)
print('\nCensus:')
census
# Use for remote connection
# Create an engine to the census database
engine = create_engine('mysql+pymysql://student:datacamp@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:3306/census')
# Print the table names
print('Engine Table Names: \n', engine.table_names())
# Copy and run as code to use with remote engine
connection = engine.connect()
metadata = MetaData()
census = Table('census', metadata, autoload=True, autoload_with=engine)
print('\nCensus:')
census
Often, you'll need to perform math operations as part of a query, such as if you wanted to calculate the change in population from 2000 to 2008. For math operations on numbers, the operators in SQLAlchemy work the same way as they do in Python.
You can use these operators to perform addition (+), subtraction (-), multiplication (*), division (/), and modulus (%) operations. Note: They behave differently when used with non-numeric column types.
Let's now find the top 5 states by population growth between 2000 and 2008.
Instructions
# Build query to return state names by population difference from 2008 to 2000: stmt
stmt = select([census.columns.state,
(census.columns.pop2008 - census.columns.pop2000).label('pop_change')])
# Append group by for the state: stmt
stmt = stmt.group_by(census.columns.state)
# Append order by for pop_change descendingly: stmt
stmt = stmt.order_by(desc('pop_change'))
# Return only 5 results: stmt
stmt = stmt.limit(5)
# Use connection to execute the statement and fetch all results
results = connection.execute(stmt).fetchall()
# Print the state and population change for each record
for result in results:
print(f'{result.state}:{result.pop_change}')
It's possible to combine functions and operators in a single select statement as well. These combinations can be exceptionally handy when we want to calculate percentages or averages, and we can also use the case() expression to operate on data that meets specific criteria while not affecting the query as a whole. The case() expression accepts a list of conditions to match and the column to return if the condition matches, followed by an else_ if none of the conditions match. We can wrap this entire expression in any function or math operation we like.
Often when performing integer division, we want to get a float back. While some databases will do this automatically, you can use the cast() function to convert an expression to a particular type.
Instructions
# import case, cast and Float from sqlalchemy - at top of notebook
# Build an expression to calculate female population in 2000
female_pop2000 = func.sum(
case([(census.columns.sex == 'F', census.columns.pop2000)], else_=0))
# Cast an expression to calculate total population in 2000 to Float
total_pop2000 = cast(func.sum(census.columns.pop2000), Float)
# Build a query to calculate the percentage of females in 2000: stmt
stmt = select([female_pop2000 / total_pop2000 * 100])
# Execute the query and store the scalar result: percent_female
percent_female = connection.execute(stmt).scalar()
percent_female
In [1]: stmt = select([census.columns.pop2008, state_fact.columns.abbreviation])
In [2]: results = connection.execute(stmt).fetchall()
In [3]: print(results)
Out[3]: [(95012, u'IL'),
(95012, u'NJ'),
(95012, u'ND'),
(95012, u'OR'),
(95012, u'DC'),
(95012, u'WI'),
…
In [1]: stmt = select([func.sum(census.columns.pop2000)])
In [2]: stmt = stmt.select_from(census.join(state_fact))
In [3]: stmt = stmt.where(state_fact.columns.circuit_court == '10')
In [4]: result = connection.execute(stmt).scalar()
In [5]: print(result)
Out[5]: 14945252
In [1]: stmt = select([func.sum(census.columns.pop2000)])
In [2]: stmt = stmt.select_from(census.join(state_fact,
census.columns.state == state_fact.columns.name))
In [3]: stmt = stmt.where(state_fact.columns.census_division_name == 'East South Central')
In [4]: result = connection.execute(stmt).scalar()
In [5]: print(result)
Out[5]: 16982311
If you have two tables that already have an established relationship, you can automatically use that relationship by just adding the columns we want from each table to the select statement. Recall that Jason constructed the following query:
stmt = select([census.columns.pop2008, state_fact.columns.abbreviation])
in order to join the census and state_fact tables and select the pop2008 column from the first and the abbreviation column from the second. In this case, the census and state_fact tables had a pre-defined relationship: the state column of the former corresponded to the name column of the latter.
In this exercise, you'll use the same predefined relationship to select the pop2000 and abbreviation columns!
Instructions
state_fact = Table('state_fact', metadata, autoload=True, autoload_with=engine2)
state_fact
# Build a statement to join census and state_fact tables: stmt
stmt = select([census.columns.pop2000, state_fact.columns.abbreviation])
# Execute the statement and get the first result: result
result = connection.execute(stmt).first()
# Loop over the keys in the result object and print the key and value
for key in result.keys():
print(key, getattr(result, key))
If you aren't selecting columns from both tables or the two tables don't have a defined relationship, you can still use the .join() method on a table to join it with another table and get extra data related to our query. The join() takes the table object you want to join in as the first argument and a condition that indicates how the tables are related to the second argument. Finally, you use the .select_from() method on the select statement to wrap the join clause. For example, in the video, Jason executed the following code to join the census table to the state_fact table such that the state column of the census table corresponded to the name column of the state_fact table.
stmt = stmt.select_from(
census.join(
state_fact, census.columns.state ==
state_fact.columns.name)
Instructions
# Build a statement to select the census and state_fact tables: stmt
stmt = select([census, state_fact])
# Add a select_from clause that wraps a join for the census and state_fact
# tables where the census state column and state_fact name column match
stmt = stmt.select_from(
census.join(state_fact, census.columns.state == state_fact.columns.name))
# Execute the statement and get the first result: result
result = connection.execute(stmt).first()
# Loop over the keys in the result object and print the key and value
for key in result.keys():
print(key, getattr(result, key))
You can use the same select statement you built in the last exercise, however, let's add a twist and only return a few columns and use the other table in a group_by() clause.
Instructions
# Build a statement to select the state, sum of 2008 population and census
# division name: stmt
stmt = select([census.columns.state,
func.sum(census.columns.pop2008),
state_fact.columns.census_division_name])
# Append select_from to join the census and state_fact tables by the census state and state_fact name columns
stmt = stmt.select_from(
census.join(state_fact, census.columns.state == state_fact.columns.name)
)
# Append a group by for the state_fact name column
stmt = stmt.group_by(state_fact.columns.name)
# Execute the statement and get the results: results
results = connection.execute(stmt).fetchall()
# Loop over the the results object and print each record.
for record in results:
print(record)
In [1]: managers = employees.alias()
In [2]: stmt = select([managers.columns.name.label('manager'),
employees.columns.name.label('employee')])
In [3]: stmt = stmt.select_from(employees.join(managers,
managers.columns.id == employees.columns.manager)
In [4]: stmt = stmt.order_by(managers.columns.name)
In [5]: print(connection.execute(stmt).fetchall())
Out[5]: [(u'FILLMORE', u'GRANT'),
(u'FILLMORE', u'ADAMS'),
(u'HARDING', u'TAFT'), ...
In [1]: managers = employees.alias()
In [2]: stmt = select([managers.columns.name, func.sum(employees.columns.sal)])
In [3]: stmt = stmt.select_from(employees.join(managers,
managers.columns.id == employees.columns.manager)
In [4]: stmt = stmt.group_by(managers.columns.name)
In [5]: print(connection.execute(stmt).fetchall())
Out[5]: [(u'FILLMORE', Decimal('96000.00')),
(u'GARFIELD', Decimal('83500.00')),
(u'HARDING', Decimal('52000.00')),
(u'JACKSON', Decimal('197000.00'))]
# Use with local file
engine = create_engine(employees_sql_data)
print('Engine Table Names: \n', engine.table_names())
connection = engine.connect()
metadata = MetaData()
employees = Table('employees', metadata, autoload=True, autoload_with=engine)
Often, you'll have tables that contain hierarchical data, such as employees and managers who are also employees. For this reason, you may wish to join a table to itself on different columns. The .alias() method, which creates a copy of a table, helps accomplish this task. Because it's the same table, you only need a where clause to specify the join condition.
Here, you'll use the .alias() method to build a query to join the employees table against itself to determine to whom everyone reports.
Instructions
# Make an alias of the employees table: managers
managers = employees.alias()
# Build a query to select manager's and their employees names: stmt
stmt = select(
[managers.columns.name.label('manager'),
employees.columns.name.label('employee')]
)
# Match managers id with employees mgr: stmt
stmt = stmt.where(managers.columns.id == employees.columns.mgr)
# Order the statement by the managers name: stmt
stmt = stmt.order_by(managers.columns.name)
# Execute statement: results
results = connection.execute(stmt).fetchall()
# Print records
for record in results:
print(record)
It's also common to want to roll up data which is in a hierarchical table. Rolling up data requires making sure you're careful which alias you use to perform the group_bys and which table you use for the function.
Here, your job is to get a count of employees for each manager.
Instructions
# Make an alias of the employees table: managers
managers = employees.alias()
# Build a query to select managers and counts of their employees: stmt
stmt = select([managers.columns.name, func.count(employees.columns.id)])
# Append a where clause that ensures the manager id and employee mgr are equal
stmt = stmt.where(managers.columns.id == employees.columns.mgr)
# Group by Managers Name
stmt = stmt.group_by(managers.columns.name)
# Execute statement: results
results = connection.execute(stmt).fetchall()
# print manager
for record in results:
print(record)
In [1]: while more_results:
partial_results = results_proxy.fetchmany(50)
if partial_results == []:
more_results = False
for row in partial_results:
state_count[row.state] += 1
In [2]: results_proxy.close()
# Import select - at top of Notebook
engine = create_engine(census_sql_data)
connection = engine.connect()
metadata = MetaData()
# Reflect census table via engine: census
census = Table('census', metadata, autoload=True, autoload_with=engine)
# Build select statement for census table: stmt
stmt = select([census])
# Print the emitted statement to see the SQL emitted
print(stmt)
# Execute the statement and print the results
results_proxy = connection.execute(stmt)
Fantastic work so far! As Jason discussed in the video, sometimes you may have the need to work on a large ResultProxy, and you may not have the memory to load all the results at once. To work around that issue, you can get blocks of rows from the ResultProxy by using the .fetchmany() method inside a loop. With .fetchmany(), give it an argument of the number of records you want. When you reach an empty list, there are no more rows left to fetch, and you have processed all the results of the query. Then you need to use the .close() method to close out the connection to the database.
You'll now have the chance to practice this on a large ResultProxy called results_proxy that has been pre-loaded for you to work with.
Instructions
# set more_results & state_count
more_results = True
state_count = dict()
# Start a while loop checking for more results
while more_results:
# Fetch the first 50 results from the ResultProxy: partial_results
partial_results = results_proxy.fetchmany(50)
# if empty list, set more_results to False
if partial_results == []:
more_results = False
# Loop over the fetched records and increment the count for the state
for row in partial_results:
if row.state in state_count:
state_count[row.state] += 1
else:
state_count[row.state] = 1
# Close the ResultProxy, and thus the connection
results_proxy.close()
# Print the count by state
print(state_count)
In the previous chapters, you interacted with existing databases and queried them in various different ways. Now, you will learn how to build your own databases and keep them updated!
In [1]: from sqlalchemy import (Table, Column, String, Integer, Decimal, Boolean)
In [2]: employees = Table('employees', metadata,
Column('id', Integer()),
Column('name', String(255)),
Column('salary', Decimal()),
Column('active', Boolean()))
In [3]: metadata.create_all(engine)
In [4]: engine.table_names()
Out[4]: [u'employees']
In [1]: employees = Table('employees', metadata,
Column('id', Integer()),
Column('name', String(255), unique=True,
nullable=False),
Column('salary', Float(), default=100.00),
Column('active', Boolean(), default=True))
In [2]: employees.constraints
Out[2]: {CheckConstraint(Column('name', String(length=255), table=<employees>,
nullable=False),
Column('salary', Float(), table=<employees>,
default=ColumnDefault(100.0)),
Column('active', Boolean(), table=<employees>,
default=ColumnDefault(True)) ...
UniqueConstraint(Column('name', String(length=255),
table=<employees>, nullable=False))}
Previously, you used the Table object to reflect a table from an existing database, but what if you wanted to create a new table? You'd still use the Table object; however, you'd need to replace the autoload and autoload_with parameters with Column objects.
The Column object takes a name, a SQLAlchemy type with an optional format, and optional keyword arguments for different constraints.
When defining the table, recall how in the video Jason passed in 255 as the maximum length of a String by using Column('name', String(255)). Checking out the slides from the video may help: you can download them by clicking on 'Slides' next to the IPython Shell.
After defining the table, you can create the table in the database by using the .create_all() method on metadata and supplying the engine as the only parameter. Go for it!
Instructions
metadata = MetaData()
# Define a new table with a name, count, amount, and valid column: data
test_data = Table('test_data', metadata,
Column('name', String(255)),
Column('count', Integer()),
Column('amount', Float()),
Column('valid', Boolean()))
engine = create_engine('sqlite:///data/intro_to_databases_in_python/test_data.sqlite')
connection = engine.connect()
# Use the metadata to create the table
metadata.create_all(engine)
# Print table details
print(repr(test_data))
connection.close()
You're now going to practice creating a table with some constraints! Often, you'll need to make sure that a column is unique, nullable, a positive value, or related to a column in another table. This is where constraints come in.
As Jason showed you in the video, in addition to constraints, you can also set a default value for the column if no data is passed to it via the default keyword on the column.
Instructions
metadata = MetaData()
# Define a new table with a name, count, amount, and valid column: data
test_data = Table('test_data', metadata,
Column('name', String(255), unique=True),
Column('count', Integer(), default=1),
Column('amount', Float()),
Column('valid', Boolean(), default=False),
extend_existing=True)
engine = create_engine('sqlite:///:memory:')
connection = engine.connect()
# Use the metadata to create the table
metadata.create_all(engine)
# Print the table details
print(repr(metadata.tables['test_data']))
connection.close()
In [1]: from sqlalchemy import insert
In [2]: stmt = insert(employees).values(id=1,
name='Jason', salary=1.00, active=True)
In [3]: result_proxy = connection.execute(stmt)
In [4]: print(result_proxy.rowcount)
Out[4]: 1
In [1]: stmt = insert(employees)
In [2]: values_list = [{'id': 2, 'name': 'Rebecca', 'salary': 2.00,
'active': True},
{'id': 3, 'name': 'Bob', 'salary': 0.00,
'active': False}]
In [3]: result_proxy = connection.execute(stmt, values_list)
In [4]: print(result_proxy.rowcount)
Out[4]: 2
There are several ways to perform an insert with SQLAlchemy; however, we are going to focus on the one that follows the same pattern as the select statement.
It uses an insert statement where you specify the table as an argument, and supply the data you wish to insert into the value via the .values() method as keyword arguments.
Here, the name of the table is data.
Instructions
# Use with local file
engine = create_engine('sqlite:///data/intro_to_databases_in_python/test_data.sqlite')
print('Engine Table Names: \n', engine.table_names())
connection = engine.connect()
metadata = MetaData()
test_data = Table('test_data', metadata, autoload=True, autoload_with=engine)
print('\nTest Data:')
test_data
# Build an insert statement to insert a record into the data table: stmt
stmt = insert(test_data).values(name='Anna', count=1, amount=1000.00, valid=True)
# Execute the statement via the connection: results
results = connection.execute(stmt)
# Print result rowcount
print(results.rowcount)
# Build a select statement to validate the insert
stmt = select([test_data]).where(test_data.columns.name == 'Anna')
# Print the result of executing the query.
print(connection.execute(stmt).first())
connection.close()
It's time to practice inserting multiple records at once!
As Jason showed you in the video, you'll want to first build a list of dictionaries that represents the data you want to insert. Then, in the .execute() method, you can pair this list of dictionaries with an insert statement, which will insert all the records in your list of dictionaries.
Instructions
engine = create_engine('sqlite:///data/intro_to_databases_in_python/test_data.sqlite')
print('Engine Table Names: \n', engine.table_names())
connection = engine.connect()
metadata = MetaData()
test_data = Table('test_data', metadata, autoload=True, autoload_with=engine)
print('\nTest Data:')
test_data
# Build a list of dictionaries: values_list
values_list = [{'name': 'Alexandria', 'count': 1, 'amount': 1000.00, 'valid': True},
{'name': 'Taylor', 'count': 1, 'amount': 750.00, 'valid': False}]
# Build an insert statement for the data table: stmt
stmt = insert(test_data)
# Execute stmt with the values_list: results
results = connection.execute(stmt, values_list)
# Print rowcount
print(results.rowcount)
connection.close()
You've done a great job so far at inserting data into tables! You're now going to learn how to load the contents of a CSV file into a table.
We have used the csv module to set up a csv_reader, which is just a reader object that can iterate over the lines in a given CSV file - in this case, a census CSV file. Using the enumerate() function, you can loop over the csv_reader to handle the results one at a time. Here, for example, the first line it would return is:
0 ['Illinois', 'M', '0', '89600', '95012']
0 is the idx - or line number - while ['Illinois', 'M', '0', '89600', '95012'] is the row, corresponding to the column names 'state' , 'sex', 'age', 'pop2000 'and 'pop2008'. 'Illinois' can be accessed with row[0], 'M' with row[1], and so on. You can create a dictionary containing this information where the keys are the column names and the values are the entries in each line. Then, by appending this dictionary to a list, you can combine it with an insert statement to load it all into a table!
Instructions
csv_reader = list()
with open(census_csv_data, newline='\n') as csvfile:
file_reader = csv.reader(csvfile)
for row in file_reader:
csv_reader.append(row)
engine = create_engine(census_sql_data)
print('Engine Table Names: \n', engine.table_names())
connection = engine.connect()
metadata = MetaData()
census = Table('census', metadata, autoload=True, autoload_with=engine)
print('\nTest Data:')
census
# Create a insert statement for census: stmt
stmt = insert(census)
# Create an empty list and zeroed row count: values_list, total_rowcount
values_list = []
total_rowcount = 0
# Enumerate the rows of csv_reader
for idx, row in enumerate(csv_reader):
#create data and append to values_list
data = {'state': row[0], 'sex': row[1], 'age': row[2], 'pop2000': row[3], 'pop2008': row[4]}
values_list.append(data)
# Check to see if divisible by 51
if idx % 51 == 0:
results = connection.execute(stmt, values_list)
total_rowcount += results.rowcount
values_list = []
# Print total rowcount
print(total_rowcount)
connection.close()
In [1]: from sqlalchemy import update
In [2]: stmt = update(employees)
In [3]: stmt = stmt.where(employees.columns.id == 3)
In [4]: stmt = stmt.values(active=True)
In [5]: result_proxy = connection.execute(stmt)
In [6]: print(result_proxy.rowcount)
Out[6]: 1
In [1]: stmt = update(employees)
In [2]: stmt = stmt.where(employees.columns.active == True)
In [3]: stmt = stmt.values(active=False, salary=0.00)
In [4]: result_proxy = connection.execute(stmt)
In [5]: print(result_proxy.rowcount)
Out[5]: 3
In [1]: new_salary = select([employees.columns.salary])
In [2]: new_salary = new_salary.order_by(desc(employees.columns.salary))
In [3]: new_salary = new_salary.limit(1)
In [4]: stmt = update(employees)
In [5]: stmt = stmt.values(salary=new_salary)
In [6]: result_proxy = connection.execute(stmt)
In [7]: print(result_proxy.rowcount)
Out[7]: 3
The update statement is very similar to an insert statement, except that it also typically uses a where clause to help us determine what data to update. You'll be using the FIPS state code using here, which is appropriated by the U.S. government to identify U.S. states and certain other associated areas. Recall that you can update all wages in the employees table as follows:
stmt = update(employees).values(wage=100.00)
For your convenience, the names of the tables and columns of interest in this exercise are: state_fact (Table), name (Column), and fips_state (Column).
Instructions
engine = create_engine(census_sql_data)
print('Engine Table Names: \n', engine.table_names())
connection = engine.connect()
metadata = MetaData()
state_fact = Table('state_fact', metadata, autoload=True, autoload_with=engine)
print('\nTest Data:')
state_fact
# Build a select statement: select_stmt
select_stmt = select([state_fact]).where(state_fact.columns.name == 'New York')
# Print the results of executing the select_stmt
print(connection.execute(select_stmt).fetchall())
# Build a statement to update the fips_state to 36: stmt
stmt = update(state_fact).values(fips_state = 36)
# Append a where clause to limit it to records for New York state
stmt = stmt.where(state_fact.columns.name == 'New York')
# Execute the statement: results
results = connection.execute(stmt)
# Print rowcount
print(results.rowcount)
# Execute the select_stmt again to view the changes
print(connection.execute(select_stmt).fetchall())
connection.close()
As Jason discussed in the video, by using a where clause that selects more records, you can update multiple records at once. It's time now to practice this!
For your convenience, the names of the tables and columns of interest in this exercise are: state_fact (Table), notes (Column), and census_region_name (Column).
Instructions
engine = create_engine(census_sql_data)
print('Engine Table Names: \n', engine.table_names())
connection = engine.connect()
metadata = MetaData()
state_fact = Table('state_fact', metadata, autoload=True, autoload_with=engine)
print('\nTest Data:')
state_fact
# Build a statement to update the notes to 'The Wild West': stmt
stmt = update(state_fact).values(notes = 'The Wild West')
# Append a where clause to match the West census region records
stmt = stmt.where(state_fact.columns.census_region_name == 'West')
# Execute the statement: results
results = connection.execute(stmt)
# Print rowcount
print(results.rowcount)
connection.close()
You can also update records with data from a select statement. This is called a correlated update. It works by defining a select statement that returns the value you want to update the record with and assigning that as the value in an update statement.
You'll be using a flat_census in this exercise as the target of your correlated update. The flat_census table is a summarized copy of your census table.
Instructions
metadata = MetaData()
# Define a new table with a name, count, amount, and valid column: data
flat_census = Table('flat_census', metadata,
Column('state_name', String(255)),
Column('fips_code', String(255)),
extend_existing=True)
engine = create_engine(census_sql_data)
connection = engine.connect()
# Use the metadata to create the table
metadata.create_all(engine)
# Print the table details
print(repr(metadata.tables['flat_census']))
state_fact = Table('state_fact', metadata, autoload=True, autoload_with=engine)
results = connection.execute(select([state_fact.columns.fips_state])).fetchall()
results[:5]
len(results)
df = pd.DataFrame(results)
df.columns = ['fips_code']
df.head()
df.to_sql('flat_census', con=engine, index=False, if_exists='append')
# Build a statement to select name from state_fact: stmt
fips_stmt = select([state_fact.columns.name])
# Append a where clause to Match the fips_state to flat_census fips_code
fips_stmt = fips_stmt.where(state_fact.columns.fips_state == flat_census.columns.fips_code)
# Build an update statement to set the name to fips_stmt: update_stmt
update_stmt = update(flat_census).values(state_name = fips_stmt)
# Execute update_stmt: results
results = connection.execute(update_stmt)
# Print rowcount
print(results.rowcount)
connection.close()
In [1]: from sqlalchemy import delete
In [2]: stmt = select([func.count(extra_employees.columns.id)])
In [3]: connection.execute(stmt).scalar()
Out[3]: 3
In [4]: delete_stmt = delete(extra_employees)
In [5]: result_proxy = connection.execute(delete_stmt)
In [6]: result_proxy.rowcount
Out[6]: 3
In [1]: stmt = delete(employees).where(employees.columns.id == 3)
In [2]: result_proxy = connection.execute(stmt)
In [3]: result_proxy.rowcount
Out[3]: 1
In [1]: extra_employees.drop(engine)
In [2]: print(extra_employees.exists(engine))
Out[2]: False
In [1]: metadata.drop_all(engine)
In [2]: engine.table_names()
Out[2]: []
Often, you'll need to empty a table of all of its records so you can reload the data. You can do this with a delete statement with just the table as an argument. For example, in the video, Jason deleted the table extra_employees by executing as follows:
delete_stmt = delete(extra_employees)
result_proxy = connection.execute(delete_stmt)
Do be careful, though, as deleting cannot be undone!
Instructions
engine = create_engine(census_sql_data)
print('Engine Table Names: \n', engine.table_names())
connection = engine.connect()
metadata = MetaData()
census = Table('census', metadata, autoload=True, autoload_with=engine)
results = connection.execute(select([census])).fetchall()
df = pd.DataFrame(results)
df.columns = ['state', 'sex', 'age', 'pop2000', 'pop2008']
_dtypes = {'state': String(30), 'sex': String(1), 'age': Integer(), 'pop2000': Integer(), 'pop2008': Integer()}
df.to_sql('census2', con=engine, index=False, if_exists='replace', dtype=_dtypes)
census2 = Table('census2', metadata, autoload=True, autoload_with=engine)
# Build a statement to empty the census table: stmt
stmt = delete(census2)
# Execute the statement: results
results = connection.execute(stmt)
# Print affected rowcount
print(results.rowcount)
# Build a statement to select all records from the census table
stmt = select([census2])
# Print the results of executing the statement to verify there are no rows
print(connection.execute(stmt).fetchall())
connection.close()
By using a where() clause, you can target the delete statement to remove only certain records. For example, Jason deleted all rows from the employees table that had id 3 with the following delete statement:
delete(employees).where(employees.columns.id == 3)
Here you'll delete ALL rows which have 'M' in the sex column and 36 in the age column. We have included code at the start which computes the total number of these rows. It is important to make sure that this is the number of rows that you actually delete.
Instructions
engine = create_engine(census_sql_data)
print('Engine Table Names: \n', engine.table_names())
connection = engine.connect()
metadata = MetaData()
census = Table('census', metadata, autoload=True, autoload_with=engine)
results = connection.execute(select([census])).fetchall()
df = pd.DataFrame(results)
df.columns = ['state', 'sex', 'age', 'pop2000', 'pop2008']
_dtypes = {'state': String(30), 'sex': String(1), 'age': Integer(), 'pop2000': Integer(), 'pop2008': Integer()}
df.to_sql('census2', con=engine, index=False, if_exists='replace', dtype=_dtypes)
census2 = Table('census2', metadata, autoload=True, autoload_with=engine)
# Build a statement to count records using the sex column for Men ('M') age 36: stmt
stmt = select([func.count(census2.columns.sex)]).where(and_(census2.columns.sex == 'M', census2.columns.age == 36))
# Execute the select statement and use the scalar() fetch method to save the record count
to_delete = connection.execute(stmt).scalar()
to_delete
# Build a statement to delete records from the census table: stmt_del
stmt_del = delete(census2)
# Append a where clause to target Men ('M') age 36
stmt_del = stmt_del.where(and_(census2.columns.sex == 'M', census2.columns.age == 36))
# Execute the statement: results
results = connection.execute(stmt_del)
# Print affected rowcount and to_delete record count, make sure they match
print(results.rowcount, to_delete)
You're now going to practice dropping individual tables from a database with the .drop() method, as well as all tables in a database with the .drop_all() method!
As Spider-Man's Uncle Ben (as well as Jason, in the video!) said: With great power, comes great responsibility. Do be careful when deleting tables, as it's not simple or fast to restore large databases! Remember, you can check to see if a table exists with the .exists() method.
This is the final exercise in this chapter: After this, you'll be ready to apply everything you've learned to a case study in the final chapter of this course!
Instructions
engine = create_engine(census_sql_data)
print('Engine Table Names: \n', engine.table_names())
connection = engine.connect()
metadata = MetaData()
state_fact = Table('state_fact', metadata, autoload=True, autoload_with=engine)
results = connection.execute(select([state_fact])).fetchall()
df = pd.DataFrame(results)
_dtypes = {'id': String(256),
'name': String(256),
'abbreviation': String(256),
'country': String(256),
'type': String(256),
'sort': String(256),
'status': String(256),
'occupied': String(256),
'notes': String(256),
'fips_state': String(256),
'assoc_press': String(256),
'standard_federal_region': String(256),
'census_region': String(256),
'census_region_name': String(256),
'census_division': String(256),
'census_division_name': String(256),
'circuit_court': String(256)}
df.columns = _dtypes.keys()
df.to_sql('state_fact2', con=engine, index=False, if_exists='replace', dtype=_dtypes)
state_fact2 = Table('state_fact2', metadata, autoload=True, autoload_with=engine)
# Drop the state_fact table
state_fact2.drop(engine)
# Check to see if state_fact exists
state_fact2.exists(engine)
# Drop all tables
metadata.drop_all(engine)
# Check to see if census exists
state_fact2.exists(engine)
Here, you will bring together all of the skills you acquired in the previous chapters to work on a real life project! From connecting to a database, to populating it, to reading and querying it, you will have a chance to apply all the key concepts you learned in this course.
In [1]: from sqlalchemy import create_engine, MetaData
In [2]: engine = create_engine('sqlite:///census_nyc.sqlite')
In [3]: metadata = MetaData()
In [4]: from sqlalchemy import (Table, Column, String, Integer, Decimal, Boolean)
In [5]: employees = Table('employees', metadata,
Column('id', Integer()),
Column('name', String(255)),
Column('salary', Decimal()),
Column('active', Boolean()))
In [6]: metadata.create_all(engine)
In this exercise, your job is to create an engine to the database that will be used in this chapter. Then, you need to initialize its metadata.
Recall how you did this in Chapter 1 by leveraging create_engine() and MetaData.
Instructions
# Import create_engine, MetaData
# from sqlalchemy import create_engine, MetaData -> done at top of notebook
# Define an engine to connect to chapter5.sqlite: engine
engine = create_engine('sqlite:///chapter5.sqlite')
# Initialize MetaData: metadata
metadata = MetaData()
Having setup the engine and initialized the metadata, you will now define the census table object and then create it in the database using the metadata and engine from the previous exercise. To create it in the database, you will have to use the .create_all() method on the metadata with engine as the argument.
It may help to refer back to the Chapter 4 exercise in which you learned how to create a table.
Instructions
# Import Table, Column, String, and Integer
# from sqlalchemy import Table, Column, String, Integer -> done at top of notebook
# Build a census table: census
census = Table('census', metadata,
Column('state', String(30)),
Column('sex', String(1)),
Column('age', Integer()),
Column('pop2000', Integer()),
Column('pop2008', Integer()))
# Create the table in the database
metadata.create_all(engine)
In [7]: values_list = []
In [8]: for row in csv_reader:
data = {'state': row[0], 'sex': row[1],
'age': row[2], 'pop2000': row[3],
'pop2008': row[4]}
values_list.append(data)
In [9]: from sqlalchemy import insert
In [10]: stmt = insert(employees)
In [11]: result_proxy = connection.execute(stmt, values_list)
In [12]: print(result_proxy.rowcount)
Out[12]: 2
Leverage the Python CSV module from the standard library and load the data into a list of dictionaries.
It may help to refer back to the Chapter 4 exercise in which you did something similar.
Instructions
# Create an empty list: values_list
values_list = list()
with open(census_csv_data, newline='\n') as csvfile:
csv_reader = csv.reader(csvfile)
# Iterate over the rows
for row in csv_reader:
# Create a dictionary with the values
data = {'state': row[0],
'sex': row[1],
'age':row[2],
'pop2000': row[3],
'pop2008': row[4]}
# Append the dictionary to the values list
values_list.append(data)
Using the multiple insert pattern, in this exercise, you will load the data from values_list into the table.
Instructions
connection = engine.connect()
# Import insert
# from sqlalchemy import insert -> done at top of notebook
# Build insert statement: stmt
stmt = insert(census)
# Use values_list to insert data: results
connection = engine.connect()
results = connection.execute(stmt, values_list)
# Print rowcount
results.rowcount
In [13]: from sqlalchemy import select
In [14]: stmt = select([census.columns.sex,
(func.sum(census.columns.pop2008 *
census.columns.age) /
func.sum(census.columns.pop2008)
).label('average_age')])
In [15]: stmt = stmt.group_by('census.columns.sex')
In [16]: results = connection.execute(stmt).fetchall()
In [17]: from sqlalchemy import case, cast, Float
In [18]: stmt = select([(func.sum(...: case([(census.columns.state == 'New York',
census.columns.pop2008)...: ], else_=0)) /
cast(func.sum(census.columns.pop2008),
Float) * 100).label('ny_percent')])
In [19]: stmt = select([census.columns.age,
(census.columns.pop2008-
census.columns.pop2000).label('pop_change')])
In [20]: stmt = stmt.order_by('pop_change')
In [21]: stmt = stmt.limit(5)
In this exercise, you will use the func.sum() and group_by() methods to first determine the average age weighted by the population in 2008, and then group by sex.
As Jason discussed in the video, a weighted average is calculated as the sum of the product of the weights and averages divided by the sum of all the weights.
For example, the following statement determines the average age weighted by the population in 2000:
stmt = select([census.columns.sex,
(func.sum(census.columns.pop2000 * census.columns.age) /
func.sum(census.columns.pop2000)).label('average_age')
])
Instructions
# Import select
# from sqlalchemy import select -> done at top of notebook
# Calculate weighted average age: stmt
stmt = select([census.columns.sex,
(func.sum(census.columns.pop2008 * census.columns.age) /
func.sum(census.columns.pop2008)).label('average_age')
])
# Group by sex
stmt = stmt.group_by('sex')
# Execute the query and store the results: results
results = connection.execute(stmt).fetchall()
results
# Print the average age by sex
for result in results:
print(result[0], result[1])
In this exercise, you will write a query to determine the percentage of the population in 2000 that comprised of women. You will group this query by state.
Instructions
# import case, cast and Float from sqlalchemy
# from sqlalchemy import case, cast, Float -> done at top of notebook
# Build a query to calculate the percentage of females in 2000: stmt
stmt = select([census.columns.state,
(func.sum(
case([
(census.columns.sex == 'F', census.columns.pop2000)
], else_=0)) /
cast(func.sum(census.columns.pop2000), Float) * 100).label('percent_female')
])
# Group By state
stmt = stmt.group_by('state')
# Execute the query and store the results: results
results = connection.execute(stmt).fetchall()
# Print the percentage
for result in results:
print(f'State: {result.state}, Percent Female: {result.percent_female}')
results_df = pd.DataFrame(results, columns=['State', 'Percent_Female'])
results_df
In this final exercise, you will write a query to calculate the states that changed the most in population. You will limit your query to display only the top 10 states.
Instructions
# Build query to return state name and population difference from 2008 to 2000
stmt = select([census.columns.state,
(census.columns.pop2008 - census.columns.pop2000).label('pop_change')])
# Group by State
stmt = stmt.group_by(census.columns.state)
# Order by Population Change
stmt = stmt.order_by(desc('pop_change'))
# Limit to top 10
stmt = stmt.limit(10)
# Use connection to execute the statement and fetch all results
results = connection.execute(stmt).fetchall()
# Print the state and population change for each record
for result in results:
print('{}:{}'.format(result.state, result.pop_change))
pop_change_df = pd.DataFrame(results, columns=['State', 'Population_Change'])
pop_change_df
connection.close()