Imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pprint import pprint as pp
import csv
from pathlib import Path
Pandas Configuration Options
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)
Data Files Location
Data File Objects
data = Path.cwd() / 'data' / 'merging-dataframes-with-pandas'
auto_fuel_file = data / 'auto_fuel_efficiency.csv'
baby_1881_file = data / 'baby_names1881.csv'
baby_1981_file = data / 'baby_names1981.csv'
exch_rates_file = data / 'exchange_rates.csv'
gdp_china_file = data / 'gdp_china.csv'
gdp_usa_file = data / 'gdp_usa.csv'
oil_price_file = data / 'oil_price.csv'
pitts_file = data / 'pittsburgh_weather_2013.csv'
sales_feb_hardware_file = data / 'sales-feb-Hardware.csv'
sales_feb_service_file = data / 'sales-feb-Service.csv'
sales_feb_software_file = data / 'sales-feb-Software.csv'
sales_jan_2015_file = data / 'sales-jan-2015.csv'
sales_feb_2015_file = data / 'sales-feb-2015.csv'
sales_mar_2015_file = data / 'sales-mar-2015.csv'
sp500_file = data / 'sp500.csv'
so_bronze_file = data / 'summer_olympics_Bronze.csv'
so_bronze5_file = data / 'summer_olympics_bronze_top5.csv'
so_gold_file = data / 'summer_olympics_Gold.csv'
so_gold5_file = data / 'summer_olympics_gold_top5.csv'
so_silver_file = data / 'summer_olympics_Silver.csv'
so_silver5_file = data / 'summer_olympics_silver_top5.csv'
so_all_medalists_file = data / 'summer_olympics_medalists 1896 to 2008 - ALL MEDALISTS.tsv'
so_editions_file = data / 'summer_olympics_medalists 1896 to 2008 - EDITIONS.tsv'
so_ioc_codes_file = data / 'summer_olympics_medalists 1896 to 2008 - IOC COUNTRY CODES.csv'
Course Description
As a Data Scientist, you'll often find that the data you need is not in a single file. It may be spread across a number of text files, spreadsheets, or databases. You want to be able to import the data of interest as a collection of DataFrames and figure out how to combine them to answer your central questions. This course is all about the act of combining, or merging, DataFrames, an essential part of any working Data Scientist's toolbox. You'll hone your pandas skills by learning how to organize, reshape, and aggregate multiple data sets to answer your specific questions.
In this chapter, you'll learn about different techniques you can use to import multiple files into DataFrames. Having imported your data into individual DataFrames, you'll then learn how to share information between DataFrames using their Indexes. Understanding how Indexes work is essential information that you'll need for merging DataFrames later in the course.
import pandas as pd
dataframe0 = pd.read_csv('sales-jan-2015.csv')
dataframe1 = pd.read_csv('sales-feb-2015.csv')
filenames = ['sales-jan-2015.csv', 'sales-feb-2015.csv']
dataframes = []
for f in filenames:
dataframes.append(pd.read_csv(f))
filenames = ['sales-jan-2015.csv', 'sales-feb-2015.csv']
dataframes = [pd.read_csv(f) for f in filenames]
from glob import glob
filenames = glob('sales*.csv')
dataframes = [pd.read_csv(f) for f in filenames]
When data is spread among several files, you usually invoke pandas' read_csv()
(or a similar data import function) multiple times to load the data into several DataFrames.
The data files for this example have been derived from a list of Olympic medals awarded between 1896 & 2008 compiled by the Guardian.
The column labels of each DataFrame are NOC
, Country
, & Total
where NOC
is a three-letter code for the name of the country and Total
is the number of medals of that type won (bronze, silver, or gold).
Instructions
# Read 'Bronze.csv' into a DataFrame: bronze
bronze = pd.read_csv(so_bronze_file)
# Read 'Silver.csv' into a DataFrame: silver
silver = pd.read_csv(so_silver_file)
# Read 'Gold.csv' into a DataFrame: gold
gold = pd.read_csv(so_gold_file)
# Print the first five rows of gold
gold.head()
As you saw in the video, loading data from multiple files into DataFrames is more efficient in a loop or a list comprehension.
Notice that this approach is not restricted to working with CSV files. That is, even if your data comes in other formats, as long as pandas has a suitable data import function, you can apply a loop or comprehension to generate a list of DataFrames imported from the source files.
Here, you'll continue working with The Guardian's Olympic medal dataset.
Instructions
# Create the list of file names: filenames
filenames = [so_bronze_file, so_silver_file, so_gold_file]
# Create the list of three DataFrames: dataframes
dataframes = []
for filename in filenames:
dataframes.append(pd.read_csv(filename))
# Print top 5 rows of 1st DataFrame in dataframes
dataframes[0].head()
In this exercise, you'll combine the three DataFrames from earlier exercises - gold
, silver
, & bronze
- into a single DataFrame called medals
. The approach you'll use here is clumsy. Later on in the course, you'll see various powerful methods that are frequently used in practice for concatenating or merging DataFrames.
Remember, the column labels of each DataFrame are NOC
, Country
, and Total
, where NOC
is a three-letter code for the name of the country and Total
is the number of medals of that type won.
Instructions
gold
called medals
using the .copy()
method.new_labels
with entries 'NOC'
, 'Country'
, & 'Gold'
. This is the same as the column labels from gold
with the column label 'Total'
replaced by 'Gold'
.medals
by assigning new_labels
to medals.columns
.'Silver'
and 'Bronze'
in medals using silver['Total']
& bronze['Total']
.medals
. This has been done for you, so hit 'Submit Answer' to see the result!# Make a copy of gold: medals
medals = gold.copy()
# Create list of new column labels: new_labels
new_labels = ['NOC', 'Country', 'Gold']
# Rename the columns of medals using new_labels
medals.columns = new_labels
# Add columns 'Silver' & 'Bronze' to medals
medals['Silver'] = silver['Total']
medals['Bronze'] = bronze['Total']
# Print the head of medals
medals.head()
del bronze, silver, gold, dataframes, medals
import pandas as pd
w_mean = pd.read_csv('quarterly_mean_temp.csv', index_col='Month')
w_max = pd.read_csv('quarterly_max_temp.csv', index_col='Month')
print(w_mean)
Mean TemperatureF
Month
Apr 61.956044
Jan 32.133333
Jul 68.934783
Oct 43.434783
print(w_max)
Max TemperatureF
Month
Jan 68
Apr 89
Jul 91
Oct 84
print(w_mean.index)
Index(['Apr', 'Jan', 'Jul', 'Oct'], dtype='object', name='Month')
print(w_max.index)
Index(['Jan', 'Apr', 'Jul', 'Oct'], dtype='object', name='Month')
print(type(w_mean.index))
<class 'pandas.indexes.base.Index'>
ordered = ['Jan', 'Apr', 'Jul', 'Oct']
w_mean2 = w_mean.reindex(ordered)
print(w_mean2)
Mean TemperatureF
Month
Jan 32.133333
Apr 61.956044
Jul 68.934783
Oct 43.434783
w_mean2.sort_index()
Mean TemperatureF
Month
Apr 61.956044
Jan 32.133333
Jul 68.934783
Oct 43.434783
w_mean.reindex(w_max.index)
Mean TemperatureF
Month
Jan 32.133333
Apr 61.956044
Jul 68.934783
Oct 43.434783
w_mean3 = w_mean.reindex(['Jan', 'Apr', 'Dec'])
print(w_mean3)
Mean TemperatureF
Month
Jan 32.133333
Apr 61.956044
Dec NaN
w_max.reindex(w_mean3.index)
Max TemperatureF
Month
Jan 68.0
Apr 89.0
Dec NaN
w_max.reindex(w_mean3.index).dropna()
Max TemperatureF
Month
Jan 68.0
Apr 89.0
w_max.reindex(w_mean.index)
Max TemperatureF
Month
Apr 89
Jan 68
Jul 91
Oct 84
w_mean.reindex(w_max.index)
Mean TemperatureF
Month
Jan 32.133333
Apr 61.956044
Jul 68.934783
Oct 43.434783
It is often useful to rearrange the sequence of the rows of a DataFrame by sorting. You don't have to implement these yourself; the principal methods for doing this are .sort_index()
and .sort_values()
.
In this exercise, you'll use these methods with a DataFrame of temperature values indexed by month names. You'll sort the rows alphabetically using the Index and numerically using a column. Notice, for this data, the original ordering is probably most useful and intuitive: the purpose here is for you to understand what the sorting methods do.
Instructions
'monthly_max_temp.csv'
into a DataFrame called weather1
with 'Month'
as the index.weather1
in alphabetical order using the .sort_index()
method and store the result in weather2
.weather1
in reverse alphabetical order by specifying the additional keyword argument ascending=False
inside .sort_index()
..sort_values()
method to sort weather1
in increasing numerical order according to the values of the column 'Max TemperatureF'
.monthly_max_temp = {'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
'Max TemperatureF': [68, 60, 68, 84, 88, 89, 91, 86, 90, 84, 72, 68]}
# Read 'monthly_max_temp.csv' into a DataFrame: weather1
# weather1 = pd.read_csv('monthly_max_temp.csv', index_col='Month')
weather1 = pd.DataFrame.from_dict(monthly_max_temp)
weather1.set_index('Month', inplace=True)
# Print the head of weather1
print(weather1.head())
# Sort the index of weather1 in alphabetical order: weather2
weather2 = weather1.sort_index()
# Print the head of weather2
print(weather2.head())
# Sort the index of weather1 in reverse alphabetical order: weather3
weather3 = weather1.sort_index(ascending=False)
# Print the head of weather3
print(weather3.head())
# Sort weather1 numerically using the values of 'Max TemperatureF': weather4
weather4 = weather1.sort_values(by='Max TemperatureF')
# Print the head of weather4
print(weather4.head())
Sorting methods are not the only way to change DataFrame Indexes. There is also the .reindex()
method.
In this exercise, you'll reindex a DataFrame of quarterly-sampled mean temperature values to contain monthly samples (this is an example of upsampling or increasing the rate of samples, which you may recall from the pandas Foundations course).
The original data has the first month's abbreviation of the quarter (three-month interval) on the Index, namely Apr
, Jan
, Jul
, and Oct
. This data has been loaded into a DataFrame called weather1
and has been printed in its entirety in the IPython Shell. Notice it has only four rows (corresponding to the first month of each quarter) and that the rows are not sorted chronologically.
You'll initially use a list of all twelve month abbreviations and subsequently apply the .ffill()
method to forward-fill the null entries when upsampling. This list of month abbreviations has been pre-loaded as year
.
Instructions
weather1
using the .reindex()
method with the list year
as the argument, which contains the abbreviations for each month.weather1
just as you did above, this time chaining the .ffill()
method to replace the null values with the last preceding non-null value.monthly_max_temp = {'Month': ['Jan', 'Apr', 'Jul', 'Oct'],
'Max TemperatureF': [32.13333, 61.956044, 68.934783, 43.434783]}
weather1 = pd.DataFrame.from_dict(monthly_max_temp)
weather1.set_index('Month', inplace=True)
weather1
year = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
# Reindex weather1 using the list year: weather2
weather2 = weather1.reindex(year)
# Print weather2
weather2
# Reindex weather1 using the list year with forward-fill: weather3
weather3 = weather1.reindex(year).ffill()
# Print weather3
weather3
Another common technique is to reindex a DataFrame using the Index of another DataFrame. The DataFrame .reindex()
method can accept the Index of a DataFrame or Series as input. You can access the Index of a DataFrame with its .index
attribute.
The Baby Names Dataset from data.gov summarizes counts of names (with genders) from births registered in the US since 1881. In this exercise, you will start with two baby-names DataFrames names_1981
and names_1881
loaded for you.
The DataFrames names_1981
and names_1881
both have a MultiIndex with levels name
and gender
giving unique labels to counts in each row. If you're interested in seeing how the MultiIndexes were set up, names_1981
and names_1881
were read in using the following commands:
names_1981 = pd.read_csv('names1981.csv', header=None, names=['name','gender','count'], index_col=(0,1))
names_1881 = pd.read_csv('names1881.csv', header=None, names=['name','gender','count'], index_col=(0,1))
As you can see by looking at their shapes, which have been printed in the IPython Shell, the DataFrame corresponding to 1981 births is much larger, reflecting the greater diversity of names in 1981 as compared to 1881.
Your job here is to use the DataFrame .reindex()
and .dropna()
methods to make a DataFrame common_names
counting names from 1881 that were still popular in 1981.
Instructions
common_names
by reindexing names_1981
using the Index of the DataFrame names_1881
of older names.common_names
DataFrame. This has been done for you. It should be the same as that of names_1881
.common_names
that have null counts using the .dropna()
method. These rows correspond to names that fell out of fashion between 1881 & 1981.common_names
DataFrame. This has been done for you, so hit 'Submit Answer' to see the result!names_1981 = pd.read_csv(baby_1981_file, header=None, names=['name', 'gender', 'count'], index_col=(0,1))
names_1981.head()
names_1881 = pd.read_csv(baby_1881_file, header=None, names=['name','gender','count'], index_col=(0,1))
names_1881.head()
# Reindex names_1981 with index of names_1881: common_names
common_names = names_1981.reindex(names_1881.index)
# Print shape of common_names
common_names.shape
# Drop rows with null counts: common_names
common_names = common_names.dropna()
# Print shape of new common_names
common_names.shape
common_names.head(10)
del weather1, weather2, weather3, weather4, common_names, names_1881, names_1981
import pandas as pd
weather = pd.read_csv('pittsburgh2013.csv', index_col='Date', parse_dates=True)
weather.loc['2013-7-1':'2013-7-7', 'PrecipitationIn']
Date
2013-07-01 0.18
2013-07-02 0.14
2013-07-03 0.00
2013-07-04 0.25
2013-07-05 0.02
2013-07-06 0.06
2013-07-07 0.10
Name: PrecipitationIn, dtype: float64
weather.loc['2013-07-01':'2013-07-07', 'PrecipitationIn'] * 2.54
Date
2013-07-01 0.4572
2013-07-02 0.3556
2013-07-03 0.0000
2013-07-04 0.6350
2013-07-05 0.0508
2013-07-06 0.1524
2013-07-07 0.2540
Name: PrecipitationIn, dtype: float64
week1_range = weather.loc['2013-07-01':'2013-07-07', ['Min TemperatureF', 'Max TemperatureF']]
print(week1_range)
Min TemperatureF Max TemperatureF
Date
2013-07-01 66 79
2013-07-02 66 84
2013-07-03 71 86
2013-07-04 70 86
2013-07-05 69 86
2013-07-06 70 89
2013-07-07 70 77
week1_mean = weather.loc['2013-07-01':'2013-07-07', 'Mean TemperatureF']
print(week1_mean)
Date
2013-07-01 72
2013-07-02 74
2013-07-03 78
2013-07-04 77
2013-07-05 76
2013-07-06 78
2013-07-07 72
Name: Mean TemperatureF, dtype: int64
week1_range / week1_mean
RuntimeWarning: Cannot compare type 'Timestamp' with type 'str', sort order is
undefined for incomparable objects
return this.join(other, how=how, return_indexers=return_indexers)
2013-07-01 00:00:00 2013-07-02 00:00:00 2013-07-03 00:00:00 \
Date
2013-07-01 NaN NaN NaN
2013-07-02 NaN NaN NaN
2013-07-03 NaN NaN NaN
2013-07-04 NaN NaN NaN
2013-07-05 NaN NaN NaN
2013-07-06 NaN NaN NaN
2013-07-07 NaN NaN NaN
2013-07-04 00:00:00 2013-07-05 00:00:00 2013-07-06 00:00:00 \
Date
2013-07-01 NaN NaN NaN
... ...
week1_range.divide(week1_mean, axis='rows')
Min TemperatureF Max TemperatureF
Date
2013-07-01 0.916667 1.097222
2013-07-02 0.891892 1.135135
2013-07-03 0.910256 1.102564
2013-07-04 0.909091 1.116883
2013-07-05 0.907895 1.131579
2013-07-06 0.897436 1.141026
2013-07-07 0.972222 1.069444
week1_mean.pct_change() * 100
Date
2013-07-01 NaN
2013-07-02 2.777778
2013-07-03 5.405405
2013-07-04 -1.282051
2013-07-05 -1.298701
2013-07-06 2.631579
2013-07-07 -7.692308
Name: Mean TemperatureF, dtype: float64
bronze = pd.read_csv('bronze_top5.csv', index_col=0)
print(bronze)
Total
Country
United States 1052.0
Soviet Union 584.0
United Kingdom 505.0
France 475.0
Germany 454.0
silver = pd.read_csv('silver_top5.csv', index_col=0)
print(silver)
Total
Country
United States 1195.0
Soviet Union 627.0
United Kingdom 591.0
France 461.0
Italy 394.0
gold = pd.read_csv('gold_top5.csv', index_col=0)
print(gold)
Total
Country
United States 2088.0
Soviet Union 838.0
United Kingdom 498.0
Italy 460.0
Germany 407.0
bronze + silver
Country
France 936.0
Germany NaN
Italy NaN
Soviet Union 1211.0
United Kingdom 1096.0
United States 2247.0
Name: Total, dtype: float64
bronze + silver
Country
France 936.0
Germany NaN
Italy NaN
Soviet Union 1211.0
United Kingdom 1096.0
United States 2247.0
Name: Total, dtype: float64
In [22]: print(bronze['United States'])
1052.0
In [23]: print(silver['United States'])
1195.0
bronze.add(silver)
Country
France 936.0
Germany NaN
Italy NaN
Soviet Union 1211.0
United Kingdom 1096.0
United States 2247.0
Name: Total, dtype: float64
bronze.add(silver, fill_value=0)
Country
France 936.0
Germany 454.0
Italy 394.0
Soviet Union 1211.0
United Kingdom 1096.0
United States 2247.0
Name: Total, dtype: float64
bronze + silver + gold
Country
France NaN
Germany NaN
Italy NaN
Soviet Union 2049.0
United Kingdom 1594.0
United States 4335.0
Name: Total, dtype: float64
bronze.add(silver, fill_value=0).add(gold, fill_value=0)
Country
France 936.0
Germany 861.0
Italy 854.0
Soviet Union 2049.0
United Kingdom 1594.0
United States 4335.0
Name: Total, dtype: float64
The DataFrames january
and february
, which have been printed in the IPython Shell, represent the sales a company made in the corresponding months.
The Indexes in both DataFrames are called Company
, identifying which company bought that quantity of units. The column Units
is the number of units sold.
If you were to add these two DataFrames
by executing the command total = january + february
, how many rows would the resulting DataFrame have? Try this in the IPython Shell and find out for yourself.
jan_dict = {'Company': ['Acme Corporation', 'Hooli', 'Initech', 'Mediacore', 'Streeplex'],
'Units': [19, 17, 20, 10, 13]}
feb_dict = {'Company': ['Acme Corporation', 'Hooli', 'Mediacore', 'Vandelay Inc'],
'Units': [15, 3, 12, 25]}
january = pd.DataFrame.from_dict(jan_dict)
january.set_index('Company', inplace=True)
print(january)
february = pd.DataFrame.from_dict(feb_dict)
february.set_index('Company', inplace=True)
print('\n', february, '\n')
print(january + february)
In this exercise, you'll work with weather data pulled from wunderground.com. The DataFrame weather
has been pre-loaded along with pandas as pd
. It has 365 rows (observed each day of the year 2013 in Pittsburgh, PA) and 22 columns reflecting different weather measurements each day.
You'll subset a collection of columns related to temperature measurements in degrees Fahrenheit, convert them to degrees Celsius, and relabel the columns of the new DataFrame to reflect the change of units.
Remember, ordinary arithmetic operators (like +
, -
, *
, and /
) broadcast scalar values to conforming DataFrames when combining scalars & DataFrames in arithmetic expressions. Broadcasting also works with pandas Series and NumPy arrays.
Instructions
temps_f
by extracting the columns 'Min TemperatureF'
, 'Mean TemperatureF'
, & 'Max TemperatureF'
from weather
as a new DataFrame temps_f
. To do this, pass the relevant columns as a list to weather[]
.temps_c
from temps_f
using the formula (temps_f - 32) * 5/9
.temps_c
to replace 'F'
with 'C'
using the .str.replace('F', 'C')
method on temps_c.columns
.temps_c
. This has been done for you, so hit 'Submit Answer' to see the result!weather = pd.read_csv(pitts_file)
weather.set_index('Date', inplace=True)
weather.head(3)
# Extract selected columns from weather as new DataFrame: temps_f
temps_f = weather[['Min TemperatureF', 'Mean TemperatureF', 'Max TemperatureF']]
# Convert temps_f to celsius: temps_c
temps_c = (temps_f - 32) * 5/9
# Rename 'F' in column names with 'C': temps_c.columns
temps_c.columns = temps_c.columns.str.replace('F', 'C')
# Print first 5 rows of temps_c
temps_c.head()
Your job in this exercise is to compute the yearly percent-change of US GDP (Gross Domestic Product) since 2008.
The data has been obtained from the Federal Reserve Bank of St. Louis and is available in the file GDP.csv
, which contains quarterly data; you will resample it to annual sampling and then compute the annual growth of GDP. For a refresher on resampling, check out the relevant material from pandas Foundations.
Instructions
'GDP.csv'
into a DataFrame called gdp
.parse_dates=True
and index_col='DATE'
.post2008
by slicing gdp
such that it comprises all rows from 2008 onward.post2008
. This has been done for you. This data has quarterly frequency so the indices are separated by three-month intervals.yearly
by resampling the slice post2008
by year. Remember, you need to chain .resample()
(using the alias 'A'
for annual frequency) with some kind of aggregation; you will use the aggregation method .last()
to select the last element when resampling.yearly
with .pct_change() * 100
.# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv(gdp_usa_file, parse_dates=True, index_col='DATE')
gdp.head()
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008-01-01':]
# Print the last 8 rows of post2008
post2008.tail(8)
# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()
# Print yearly
yearly
# Compute percentage growth of yearly: yearly['growth']
yearly['growth'] = yearly.pct_change() * 100
# Print yearly again
yearly
In this exercise, stock prices in US Dollars for the S&P 500 in 2015 have been obtained from Yahoo Finance. The files sp500.csv
for sp500 and exchange.csv
for the exchange rates are both provided to you.
Using the daily exchange rate to Pounds Sterling, your task is to convert both the Open and Close column prices.
Instructions
sp500
& exchange
from the files 'sp500.csv'
& 'exchange.csv'
respectively..parse_dates=True
and index_col='Date'
.'Open'
& 'Close'
from the DataFrame sp500
as a new DataFrame dollars
and print the first 5 rows.pounds
by converting US dollars to British pounds. You'll use the .multiply()
method of dollars
with exchange['GBP/USD']
and axis='rows'
pounds
. This has been done for you, so hit 'Submit Answer' to see the results!.# Read 'sp500.csv' into a DataFrame: sp500
sp500 = pd.read_csv(sp500_file, parse_dates=True, index_col='Date')
sp500.head()
# Read 'exchange.csv' into a DataFrame: exchange
exchange = pd.read_csv(exch_rates_file, parse_dates=True, index_col='Date')
exchange.head()
# Subset 'Open' & 'Close' columns from sp500: dollars
dollars = sp500[['Open', 'Close']]
# Print the head of dollars
dollars.head()
# Convert dollars to pounds: pounds
pounds = dollars.multiply(exchange['GBP/USD'], axis='rows')
# Print the head of pounds
pounds.head()
del january, february, feb_dict, jan_dict, weather, temps_f, temps_c, gdp, post2008, yearly, sp500, exchange, dollars, pounds
Having learned how to import multiple DataFrames and share information using Indexes, in this chapter you'll learn how to perform database-style operations to combine DataFrames. In particular, you'll learn about appending and concatenating DataFrames while working with a variety of real-world datasets.
import pandas as pd
northeast = pd.Series(['CT', 'ME', 'MA', 'NH', 'RI', 'VT', 'NJ', 'NY', 'PA'])
south = pd.Series(['DE', 'FL', 'GA', 'MD', 'NC', 'SC', 'VA', 'DC', 'WV', 'AL', 'KY', 'MS', 'TN', 'AR', 'LA', 'OK', 'TX'])
midwest = pd.Series(['IL', 'IN', 'MN', 'MO', 'NE', 'ND', 'SD', 'IA', 'KS', 'MI', 'OH', 'WI'])
west = pd.Series(['AZ', 'CO', 'ID', 'MT',
east = northeast.append(south)
print(east)
0 CT 7 DC
1 ME 8 WV
2 MA 9 AL
3 NH 10 KY
4 RI 11 MS
5 VT 12 TN
6 NJ 13 AR
7 NY 14 LA
8 PA 15 OK
0 DE 16 TX
1 FL dtype: object
2 GA
3 MD
4 NC
5 SC
6 VA
print(east.index)
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], dtype='int64')
print(east.loc[3])
3 NH
3 MD
dtype: object
new_east = northeast.append(south).reset_index(drop=True)
print(new_east.head(11))
0 CT
1 ME
2 MA
3 NH
4 RI
5 VT
6 NJ
7 NY
8 PA
9 DE
10 FL
dtype: object
print(new_east.index)
RangeIndex(start=0, stop=26, step=1)
east = pd.concat([northeast, south])
print(east.head(11))
0 CT
1 ME
2 MA
3 NH
4 RI
5 VT
6 NJ
7 NY
8 PA
0 DE
1 FL
dtype: object
print(east.index)
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], dtype='int64')
new_east = pd.concat([northeast, south], ignore_index=True)
print(new_east.head(11))
0 CT
1 ME
2 MA
3 NH
4 RI
5 VT
6 NJ
7 NY
8 PA
9 DE
10 FL
dtype: object
print(new_east.index)
RangeIndex(start=0, stop=26, step=1)
The Series bronze
and silver
, which have been printed in the IPython Shell, represent the 5 countries that won the most bronze and silver Olympic medals respectively between 1896 & 2008. The Indexes of both Series are called Country
and the values are the corresponding number of medals won.
If you were to run the command combined = bronze.append(silver)
, how many rows would combined
have? And how many rows would combined.loc['United States']
return? Find out for yourself by running these commands in the IPython Shell.
Instructions
Possible Answers
bronze = pd.read_csv(so_bronze5_file, index_col=0)
bronze
silver = pd.read_csv(so_silver5_file, index_col=0)
silver
combined = bronze.append(silver)
combined
combined.loc['United States']
In this exercise, you'll load sales data from the months January, February, and March into DataFrames. Then, you'll extract Series with the 'Units'
column from each and append them together with method chaining using .append()
.
To check that the stacking worked, you'll print slices from these Series, and finally, you'll add the result to figure out the total units sold in the first quarter.
Instructions
'sales-jan-2015.csv'
, 'sales-feb-2015.csv'
and 'sales-mar-2015.csv'
into the DataFrames jan
, feb
, and mar
respectively.parse_dates=True
and index_col='Date'
.'Units'
column of jan
, feb
, and mar
to create the Series jan_units
, feb_units
, and mar_units
respectively.quarter1
by appending feb_units
to jan_units
and then appending mar_units
to the result. Use chained calls to the .append()
method to do this.quarter1
has the individual Series stacked vertically. To do this:jan 27, 2015
to feb 2, 2015
.feb 26, 2015
to mar 7, 2015
.quarter1
. This has been done for you, so hit 'Submit Answer' to see the result!# Load 'sales-jan-2015.csv' into a DataFrame: jan
jan = pd.read_csv(sales_jan_2015_file, parse_dates=True, index_col='Date')
# Load 'sales-feb-2015.csv' into a DataFrame: feb
feb = pd.read_csv(sales_feb_2015_file, parse_dates=True, index_col='Date')
# Load 'sales-mar-2015.csv' into a DataFrame: mar
mar = pd.read_csv(sales_mar_2015_file, parse_dates=True, index_col='Date')
# Extract the 'Units' column from jan: jan_units
jan_units = jan['Units']
# Extract the 'Units' column from feb: feb_units
feb_units = feb['Units']
# Extract the 'Units' column from mar: mar_units
mar_units = mar['Units']
# Append feb_units and then mar_units to jan_units: quarter1
quarter1 = jan_units.append(feb_units).append(mar_units)
# Print the first slice from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])
# Print the second slice from quarter1
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])
# Compute & print total sales in quarter1
print(quarter1.sum())
Having learned how to append Series, you'll now learn how to achieve the same result by concatenating Series instead. You'll continue to work with the sales data you've seen previously. This time, the DataFrames jan
, feb
, andmar
have been pre-loaded.
Your job is to use pd.concat()
with a list of Series to achieve the same result that you would get by chaining calls to .append()
.
You may be wondering about the difference between pd.concat()
and pandas' .append()
method. One way to think of the difference is that .append()
is a specific case of a concatenation, while pd.concat()
gives you more flexibility, as you'll see in later exercises.
Instructions
units
. This has been done for you.for
loop to iterate over [jan, feb, mar]
:'Units'
column of each DataFrame to units
.units
into a longer Series called quarter1
using pd.concat()
.axis='rows'
to stack the Series vertically.quarter1
has the individual Series stacked vertically by printing slices. This has been done for you, so hit 'Submit Answer' to see the result!# Initialize empty list: units
units = []
# Build the list of Series
for month in [jan, feb, mar]:
units.append(month.Units)
# Concatenate the list: quarter1
quarter1 = pd.concat(units, axis='rows')
# Print slices from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])
del bronze, silver, combined, jan, feb, mar, jan_units, feb_units, mar_units, quarter1
In [1]: import pandas as pd
In [2]: pop1 = pd.read_csv('population_01.csv', index_col=0)
In [3]: pop2 = pd.read_csv('population_02.csv', index_col=0)
pop1_data = {'Zip Code ZCTA': [66407, 72732, 50579, 46421], '2010 Census Population': [479, 4716, 2405, 30670]}
pop2_data = {'Zip Code ZCTA': [12776, 76092, 98360, 49464], '2010 Census Population': [2180, 26669, 12221, 27481]}
pop1 = pd.DataFrame.from_dict(pop1_data)
pop1.set_index('Zip Code ZCTA', drop=True, inplace=True)
pop2 = pd.DataFrame.from_dict(pop2_data)
pop2.set_index('Zip Code ZCTA', drop=True, inplace=True)
pop1
pop2
print(type(pop1), pop1.shape)
print(type(pop2), pop2.shape)
pop1.append(pop2)
print(pop1.index.name, pop1.columns)
print(pop2.index.name, pop2.columns)
population = pd.read_csv('population_00.csv', index_col=0)
unemployment = pd.read_csv('unemployment_00.csv', index_col=0)
pop_data = {'Zip Code ZCTA': [57538, 59916, 37660, 2860], '2010 Census Population': [322, 130, 40038, 45199]}
emp_data = {'Zip': [2860, 46167, 1097, 80808], 'unemployment': [0.11, 0.02, 0.33, 0.07], 'participants': [34447, 4800, 42, 4310]}
population = pd.DataFrame.from_dict(pop_data)
population.set_index('Zip Code ZCTA', drop=True, inplace=True)
unemployment = pd.DataFrame.from_dict(emp_data)
unemployment.set_index('Zip', drop=True, inplace=True)
population
unemployment
population.append(unemployment, sort=True)
population.append(unemployment, sort=True)
axis=0
, pd.concat
is the same as population.append(unemployment, sort=True)
pd.concat([population, unemployment], axis=0, sort=True)
pd.concat([population, unemployment], axis=1, sort=True)
del pop1_data, pop2_data, pop1, pop2, pop_data, emp_data, population, unemployment
In this exercise, you'll use the Baby Names Dataset (from data.gov) again. This time, both DataFrames names_1981
and names_1881
are loaded without specifying an Index column (so the default Indexes for both are RangeIndexes).
You'll use the DataFrame .append()
method to make a DataFrame combined_names
. To distinguish rows from the original two DataFrames, you'll add a 'year'
column to each with the year (1881 or 1981 in this case). In addition, you'll specify ignore_index=True
so that the index values are not used along the concatenation axis. The resulting axis will instead be labeled 0, 1, ..., n-1
, which is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
Instructions
'year'
column in the DataFrames names_1881
and names_1981
, with values of 1881
and 1981
respectively. Recall that assigning a scalar value to a DataFrame column broadcasts that value throughout.combined_names
by appending the rows of names_1981
underneath the rows of names_1881
. Specify the keyword argument ignore_index=True
to make a new RangeIndex of unique integers for each row.combined_names
that have the name 'Morgan'
. To do this, use the .loc[]
accessor with an appropriate filter. The relevant column of combined_names
here is 'name'
.names_1881 = pd.read_csv(baby_1881_file, header=None, names=['name', 'gender', 'count'])
names_1981 = pd.read_csv(baby_1981_file, header=None, names=['name', 'gender', 'count'])
names_1981.head()
# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981
# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = names_1881.append(names_1981, ignore_index=True, sort=False)
# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)
# Print all rows that contain the name 'Morgan'
combined_names[combined_names.name == 'Morgan']
The function pd.concat()
can concatenate DataFrames horizontally as well as vertically (vertical is the default). To make the DataFrames stack horizontally, you have to specify the keyword argument axis=1
oraxis='columns'
.
In this exercise, you'll use weather data with maximum and mean daily temperatures sampled at different rates (quarterly versus monthly). You'll concatenate the rows of both and see that, where rows are missing in the coarser DataFrame, null values are inserted in the concatenated DataFrame. This corresponds to an outer join (which you will explore in more detail in later exercises).
The files 'quarterly_max_temp.csv'
and 'monthly_mean_temp.csv'
have been pre-loaded into the DataFrames weather_max
and weather_mean
respectively, and pandas
has been imported as pd
.
Instructions
weather
by concatenating the DataFrames weather_max
and weather_mean
horizontally.pd.concat()
as a list and specify the keyword argument axis=1
to stack them horizontally.weather
.weather_mean_data = {'Mean TemperatureF': [53.1, 70., 34.93548387, 28.71428571, 32.35483871, 72.87096774, 70.13333333, 35., 62.61290323, 39.8, 55.4516129 , 63.76666667],
'Month': ['Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep']}
weather_max_data = {'Max TemperatureF': [68, 89, 91, 84], 'Month': ['Jan', 'Apr', 'Jul', 'Oct']}
weather_mean = pd.DataFrame.from_dict(weather_mean_data)
weather_mean.set_index('Month', inplace=True, drop=True)
weather_max = pd.DataFrame.from_dict(weather_max_data)
weather_max.set_index('Month', inplace=True, drop=True)
weather_max
weather_mean
# Concatenate weather_max and weather_mean horizontally: weather
weather = pd.concat([weather_max, weather_mean], axis=1, sort=True)
# Print weather
weather
It is often convenient to build a large DataFrame by parsing many files as DataFrames and concatenating them all at once. You'll do this here with three files, but, in principle, this approach can be used to combine data from dozens or hundreds of files.
Here, you'll work with DataFrames compiled from The Guardian's Olympic medal dataset.
pandas
has been imported as pd
and two lists have been pre-loaded: An empty list called medals
, and medal_types
, which contains the strings 'bronze'
, 'silver'
, and 'gold'
.
Instructions
medal_types
in the for loop
.for
loop:file_name
using string interpolation with the loop variable medal
. This has been done for you. The expression "%s_top5.csv" % medal
evaluates as a string with the value of medal
replacing %s
in the format string.columns
. This has been done for you.file_name
into a DataFrame called medal_df
. Specify the keyword arguments header=0
, index_col='Country'
, and names=columns
to get the correct row and column Indexes.medal_df
to medals
using the list .append()
method.medals
horizontally (using axis='columns'
) to create a single DataFrame called medals
. Print it in its entirety.top_five = data.glob('*_top5.csv')
for file in top_five:
print(file)
medal_types = ['bronze', 'silver', 'gold']
medal_list = list()
for medal in medal_types:
# Create the file name: file_name
file_name = data / f'summer_olympics_{medal}_top5.csv'
# Create list of column names: columns
columns = ['Country', medal]
# Read file_name into a DataFrame: df
medal_df = pd.read_csv(file_name, header=0, index_col='Country', names=columns)
# Append medal_df to medals
medal_list.append(medal_df)
# Concatenate medals horizontally: medals
medals = pd.concat(medal_list, axis='columns', sort=True)
# Print medals
medals
del names_1881, names_1981, combined_names, weather_mean_data, weather_max_data, weather_mean, weather_max, weather, top_five, medals, medal_list
import pandas as pd
file1 = 'q1_rainfall_2013.csv'
rain2013 = pd.read_csv(file1, index_col='Month', parse_dates=True)
file2 = 'q1_rainfall_2014.csv'
rain2014 = pd.read_csv(file2, index_col='Month', parse_dates=True)
rain_2013_data = {'Month': ['Jan', 'Feb', 'Mar'], 'Precipitation': [0.096129, 0.067143, 0.061613]}
rain_2014_data = {'Month': ['Jan', 'Feb', 'Mar'], 'Precipitation': [0.050323, 0.082143, 0.070968]}
rain2013 = pd.DataFrame.from_dict(rain_2013_data)
rain2013.set_index('Month', inplace=True)
rain2014 = pd.DataFrame.from_dict(rain_2014_data)
rain2014.set_index('Month', inplace=True)
rain2013
rain2014
pd.concat([rain2013, rain2014], axis=0)
rain1314 = pd.concat([rain2013, rain2014], keys=[2013, 2014], axis=0)
rain1314
rain1314.loc[2014]
rain1314 = pd.concat([rain2013, rain2014], axis='columns')
rain1314
rain1314 = pd.concat([rain2013, rain2014], keys=[2013, 2014], axis='columns')
rain1314
rain1314[2013]
rain_dict = {2013: rain2013, 2014: rain2014}
rain1314 = pd.concat(rain_dict, axis='columns')
rain1314
del rain_2013_data, rain_2014_data, rain2013, rain2014, rain1314
When stacking a sequence of DataFrames vertically, it is sometimes desirable to construct a MultiIndex to indicate the DataFrame from which each row originated. This can be done by specifying the keys
parameter in the call to pd.concat()
, which generates a hierarchical index with the labels from keys
as the outermost index label. So you don't have to rename the columns of each DataFrame as you load it. Instead, only the Index column needs to be specified.
Here, you'll continue working with DataFrames compiled from The Guardian's Olympic medal dataset. Once again, pandas
has been imported as pd
and two lists have been pre-loaded: An empty list called medals
, and medal_types
, which contains the strings 'bronze'
, 'silver'
, and 'gold'
.
Instructions
for
loop:file_name
into a DataFrame called medal_df
. Specify the index to be 'Country'
.medal_df
to medals
.medals
into a single DataFrame called medals
. Be sure to use the keyword argument keys=['bronze', 'silver', 'gold']
to create a vertically stacked DataFrame with a MultiIndex.medals
. This has been done for you, so hit 'Submit Answer' to see the result!medal_types = ['bronze', 'silver', 'gold']
medal_list = list()
for medal in medal_types:
# Create the file name: file_name
file_name = data / f'summer_olympics_{medal}_top5.csv'
# Read file_name into a DataFrame: medal_df
medal_df = pd.read_csv(file_name, index_col='Country')
# Append medal_df to medals
medal_list.append(medal_df)
# Concatenate medals: medals
medals = pd.concat(medal_list, keys=['bronze', 'silver', 'gold'])
# Print medals in entirety
print(medals)
This exercise picks up where the last ended (again using The Guardian's Olympic medal dataset).
You are provided with the MultiIndexed DataFrame as produced at the end of the preceding exercise. Your task is to sort the DataFrame and to use the pd.IndexSlice
to extract specific slices. Check out this exercise from Manipulating DataFrames with pandas to refresh your memory on how to deal with MultiIndexed DataFrames.
pandas
has been imported for you as pd
and the DataFrame medals
is already in your namespace.
Instructions
medals_sorted
with the entries of medals
sorted. Use .sort_index(level=0)
to ensure the Index is sorted suitably.pd.IndexSlice
called idx
. A slicer pd.IndexSlice
is required when slicing on the inner level of a MultiIndex..loc[]
accessor with idx[:,'United Kingdom'], :
.# Sort the entries of medals: medals_sorted
medals_sorted = medals.sort_index(level=0)
# Print the number of Bronze medals won by Germany
print(medals_sorted.loc[('bronze','Germany')])
# Print data about silver medals
print(medals_sorted.loc['silver'])
# Create alias for pd.IndexSlice: idx
idx = pd.IndexSlice
# Print all the data on medals won by the United Kingdom
medals_sorted.loc[idx[:, 'United Kingdom'], :]
It is also possible to construct a DataFrame with hierarchically indexed columns. For this exercise, you'll start with pandas imported and a list of three DataFrames called dataframes
. All three DataFrames contain 'Company'
, 'Product'
, and 'Units'
columns with a 'Date'
column as the index pertaining to sales transactions during the month of February, 2015. The first DataFrame describes Hardware
transactions, the second describes Software
transactions, and the third, Service
transactions.
Your task is to concatenate the DataFrames horizontally and to create a MultiIndex on the columns. From there, you can summarize the resulting DataFrame and slice some information from it.
Instructions
february
with MultiIndexed columns by concatenating the list dataframes
.axis=1
to stack the DataFrames horizontally and the keyword argument keys=['Hardware', 'Software', 'Service']
to construct a hierarchical Index from each DataFrame.february
using the .info()
method. This has been done for you.idx
for pd.IndexSlice
.slice_2_8
from february
(using .loc[]
& idx
) that comprises rows between Feb. 2, 2015 to Feb. 8, 2015 from columns under 'Company'
.slice_2_8
. This has been done for you, so hit 'Submit Answer' to see the sliced data!hw = pd.read_csv(sales_feb_hardware_file, index_col='Date')
sw = pd.read_csv(sales_feb_software_file, index_col='Date')
sv = pd.read_csv(sales_feb_service_file, index_col='Date')
dataframes = [hw, sw, sv]
dataframes
# Concatenate dataframes: february
february = pd.concat(dataframes, axis=1, keys=['Hardware', 'Software', 'Service'], sort=True)
# Print february.info()
february.info()
february
# Assign pd.IndexSlice: idx
idx = pd.IndexSlice
# Create the slice: slice_2_8
slice_2_8 = february.loc['2015-02-02':'2015-02-08', idx[:, 'Company']]
# Print slice_2_8
slice_2_8
You're now going to revisit the sales data you worked with earlier in the chapter. Three DataFrames jan
, feb
, and mar
have been pre-loaded for you. Your task is to aggregate the sum of all sales over the 'Company'
column into a single DataFrame. You'll do this by constructing a dictionary of these DataFrames and then concatenating them.
Instructions
month_list
consisting of the tuples ('january', jan)
, ('february', feb)
, and ('march', mar)
.month_dict
.for
loop:month_data
by 'Company'
and use .sum()
to aggregate.sales
by concatenating the DataFrames stored in month_dict
.pd.IndexSlice
and print all sales by 'Mediacore'
. This has been done for you, so hit 'Submit Answer' to see the result!jan = pd.read_csv(sales_jan_2015_file)
feb = pd.read_csv(sales_feb_2015_file)
mar = pd.read_csv(sales_mar_2015_file)
mar
# Make the list of tuples: month_list
month_list = [('january', jan), ('february', feb), ('march', mar)]
# Create an empty dictionary: month_dict
month_dict = dict()
for month_name, month_data in month_list:
# Group month_data: month_dict[month_name]
month_dict[month_name] = month_data.groupby(['Company']).sum()
# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)
# Print sales
print(sales)
# Print all sales by Mediacore
idx = pd.IndexSlice
print(sales.loc[idx[:, 'Mediacore'], :])
del medal_types, medal_list, medal_df, medals, medals_sorted, idx, hw, sw, sv, dataframes, february, slice_2_8
A = np.arange(8).reshape(2, 4) + 0.1
A
B = np.arange(6).reshape(2,3) + 0.2
B
C = np.arange(12).reshape(3,4) + 0.3
C
np.hstack([B, A])
np.concatenate([B, A], axis=1)
np.vstack([A, C])
np.concatenate([A, C], axis=0)
np.concatenate([A, B], axis=0) # incompatible columns
np.concatenate([A, C], axis=1) # incompatible rows
population = pd.read_csv('population_00.csv', index_col=0)
unemployment = pd.read_csv('unemployment_00.csv', index_col=0)
print(population)
2010 Census Population
Zip Code ZCTA
57538 322
59916 130
37660 40038
2860 45199
print(unemployment)
unemployment participants
Zip
2860 0.11 34447
46167 0.02 4800
1097 0.33 42
80808 0.07 4310
population_array = np.array(population)
print(population_array) # Index info is lost
[[ 322]
[ 130]
[40038]
[45199]]
unemployment_array = np.array(unemployment)
print(population_array)
[[ 1.10000000e-01 3.44470000e+04]
[ 2.00000000e-02 4.80000000e+03]
[ 3.30000000e-01 4.20000000e+01]
[ 7.00000000e-02 4.31000000e+03]]
print(np.concatenate([population_array, unemployment_array], axis=1))
[[ 3.22000000e+02 1.10000000e-01 3.44470000e+04]
[ 1.30000000e+02 2.00000000e-02 4.80000000e+03]
[ 4.00380000e+04 3.30000000e-01 4.20000000e+01]
[ 4.51990000e+04 7.00000000e-02 4.31000000e+03]]
pd.concat([population, unemployment], axis=1, join='inner')
2010 Census Population unemployment participants
2860 45199 0.11 34447
pd.concat([population, unemployment], axis=1, join='outer')
2010 Census Population unemployment participants
1097 NaN 0.33 42.0
2860 45199.0 0.11 34447.0
37660 40038.0 NaN NaN
46167 NaN 0.02 4800.0
57538 322.0 NaN NaN
59916 130.0 NaN NaN
80808 NaN 0.07 4310.0
pd.concat([population, unemployment], join='inner', axis=0)
Empty DataFrame
Columns: []
Index: [2860, 46167, 1097, 80808, 57538, 59916, 37660, 2860]
Here, you'll continue working with DataFrames compiled from The Guardian's Olympic medal dataset.
The DataFrames bronze
, silver
, and gold
have been pre-loaded for you.
Your task is to compute an inner join.
Instructions
medal_list
with entries bronze
, silver
, and gold
.medal_list
horizontally with an inner join to create medals
.keys=['bronze', 'silver', 'gold']
to yield suitable hierarchical indexing.axis=1
to get horizontal concatenation.join='inner'
to keep only rows that share common index labels.medals
.bronze = pd.read_csv(so_bronze5_file)
silver = pd.read_csv(so_silver_file)
gold = pd.read_csv(so_gold_file)
# Create the list of DataFrames: medal_list
medal_list = [bronze, silver, gold]
# Concatenate medal_list horizontally using an inner join: medals
medals = pd.concat(medal_list, keys=['bronze', 'silver', 'gold'], axis=1, join='inner')
# Print medals
medals
In this exercise, you'll compare the historical 10-year GDP (Gross Domestic Product) growth in the US and in China. The data for the US starts in 1947 and is recorded quarterly; by contrast, the data for China starts in 1961 and is recorded annually.
You'll need to use a combination of resampling and an inner join to align the index labels. You'll need an appropriate offset alias for resampling, and the method .resample()
must be chained with some kind of aggregation method (.pct_change()
and .last()
in this case).
pandas
has been imported as pd
, and the DataFrames china
and us
have been pre-loaded, with the output of china.head()
and us.head()
printed in the IPython Shell.
Instructions
china_annual
by resampling the DataFrame china
with .resample('A').last()
(i.e., with annual frequency) and chaining two method calls:.pct_change(10)
as an aggregation method to compute the percentage change with an offset of ten years..dropna()
to eliminate rows containing null values.us_annual
by resampling the DataFrame us
exactly as you resampled china
.china_annual
and us_annual
to construct a DataFrame called gdp
. Use join='inner'
to perform an inner join and use axis=1
to concatenate horizontally.gdp
every decade (i.e., using .resample('10A')
) and aggregating with the method .last()
. This has been done for you, so hit 'Submit Answer' to see the result!china = pd.read_csv(gdp_china_file, parse_dates=['Year'])
china.rename(columns={'GDP': 'China'}, inplace=True)
china.set_index('Year', inplace=True)
us = pd.read_csv(gdp_usa_file, parse_dates=['DATE'])
us.rename(columns={'DATE': 'Year', 'VALUE': 'US'}, inplace=True)
us.set_index('Year', inplace=True)
china.head()
us.head()
# Resample and tidy china: china_annual
china_annual = china.resample('A').last().pct_change(10).dropna()
china_annual.head()
# Resample and tidy us: us_annual
us_annual = us.resample('A').last().pct_change(10).dropna()
us_annual.head()
# Concatenate china_annual and us_annual: gdp
gdp = pd.concat([china_annual, us_annual], join='inner', axis=1)
# Resample gdp and print
gdp.resample('10A').last()
del bronze, silver, gold, medal_list, medals, china, us, china_annual, us_annual, gdp
Here, you'll learn all about merging pandas DataFrames. You'll explore different techniques for merging, and learn about left joins, right joins, inner joins, and outer joins, as well as when to use which. You'll also learn about ordered merging, which is useful when you want to merge DataFrames whose columns have natural orderings, like date-time columns.
merge()
extends concat()
with the ability to align rows using multiple columnspa_zipcode_population = {'Zipcode': [16855, 15681, 18657, 17307, 15635],
'2010 Census Population': [282, 5241, 11985, 5899, 220]}
pa_zipcode_city = {'Zipcode': [17545,18455, 17307, 15705, 16833, 16220, 18618, 16855, 16623, 15635, 15681, 18657, 15279, 17231, 18821],
'City': ['MANHEIM', 'PRESTON PARK', 'BIGLERVILLE', 'INDIANA', 'CURWENSVILLE', 'CROWN', 'HARVEYS LAKE', 'MINERAL SPRINGS',
'CASSVILLE', 'HANNASTOWN', 'SALTSBURG', 'TUNKHANNOCK', 'PITTSBURG', 'LEMASTERS', 'GREAT BEND'],
'State': ['PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA']}
population = pd.DataFrame.from_dict(pa_zipcode_population)
population
cities = pd.DataFrame.from_dict(pa_zipcode_city)
cities
pd.merge()
computes a merge on ALL columns that occur in both DataFramespd.merge(population, cities)
bronze = pd.read_csv(so_bronze_file)
bronze.head()
len(bronze)
gold = pd.read_csv(so_gold_file)
gold.head()
len(gold)
pd.merge()
uses all columns common to both DataFrames to mergeso_merge = pd.merge(bronze, gold)
so_merge.head()
len(so_merge)
so_merge.columns
so_merge.index
so_merge = pd.merge(bronze, gold, on='NOC')
so_merge.head()
len(so_merge)
so_merge = pd.merge(bronze, gold, on=['NOC', 'Country'])
so_merge.head()
so_merge = pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'])
so_merge.head()
pa_counties = {'CITY NAME': ['SALTSBURG', 'MINERAL SPRINGS', 'BIGLERVILLE', 'HANNASTOWN', 'TUNKHANNOCK'],
'COUNTY NAME': ['INDIANA', 'CLEARFIELD', 'ADAMS', 'WESTMORELAND', 'WYOMING']}
counties = pd.DataFrame.from_dict(pa_counties)
counties
cities.tail()
pd.merge(counties, cities, left_on='CITY NAME', right_on='City')
pd.merge(cities, counties, left_on='City', right_on='CITY NAME')
del pa_zipcode_population, pa_zipcode_city, population, cities, bronze, gold, so_merge, pa_counties, counties
Suppose your company has operations in several different cities under several different managers. The DataFrames revenue and managers contain partial information related to the company. That is, the rows of the city columns don't quite match in revenue and managers (the Mendocino branch has no revenue yet since it just opened and the manager of Springfield branch recently left the company).
The DataFrames have been printed in the IPython Shell. If you were to run the command combined = pd.merge(revenue, managers, on='city')
, how many rows would combined have?
rev = {'city': ['Austin', 'Denver', 'Springfield'], 'revenue': [100, 83, 4]}
man = {'city': ['Austin', 'Denver', 'Mendocino'], 'manager': ['Charles', 'Joel', 'Brett']}
revenue = pd.DataFrame.from_dict(rev)
managers = pd.DataFrame.from_dict(man)
combined = pd.merge(revenue, managers, on='city')
combined
This exercise follows on the last one with the DataFrames revenue
and managers
for your company. You expect your company to grow and, eventually, to operate in cities with the same name on different states. As such, you decide that every branch should have a numerical branch identifier. Thus, you add a branch_id
column to both DataFrames. Moreover, new cities have been added to both the revenue
and managers
DataFrames as well. pandas
has been imported as pd and both DataFrames are available in your namespace.
At present, there should be a 1-to-1 relationship between the city
and branch_id
fields. In that case, the result of a merge on the city
columns ought to give you the same output as a merge on the branch_id
columns. Do they? Can you spot an ambiguity in one of the DataFrames?
Instructions
pd.merge()
, merge the DataFrames revenue
and managers
on the 'city'
column of each. Store the result as merge_by_city
.merge_by_city
. This has been done for you.revenue
and managers
on the 'branch_id'
column of each. Store the result as merge_by_id
.merge_by_id
. This has been done for you, so hit 'Submit Answer' to see the result!rev = {'city': ['Austin', 'Denver', 'Springfield', 'Mendocino'], 'revenue': [100, 83, 4, 200], 'branch_id': [10, 20, 30, 47]}
man = {'city': ['Austin', 'Denver', 'Mendocino', 'Springfield'], 'manager': ['Charles', 'Joel', 'Brett', 'Sally'], 'branch_id': [10, 20, 47, 31]}
revenue = pd.DataFrame.from_dict(rev)
managers = pd.DataFrame.from_dict(man)
# Merge revenue with managers on 'city': merge_by_city
merge_by_city = pd.merge(revenue, managers, on='city')
# Print merge_by_city
merge_by_city
# Merge revenue with managers on 'branch_id': merge_by_id
merge_by_id = pd.merge(revenue, managers, on='branch_id')
# Print merge_by_id
merge_by_id
Notice that when you merge on 'city'
, the resulting DataFrame has a peculiar result: In row 2, the city Springfield has two different branch IDs. This is because there are actually two different cities named Springfield - one in the State of Illinois, and the other in Missouri. The revenue
DataFrame has the one from Illinois, and the managers
DataFrame has the one from Missouri. Consequently, when you merge on 'branch_id'
, both of these get dropped from the merged DataFrame.
You continue working with the revenue
& managers
DataFrames from before. This time, someone has changed the field name 'city'
to 'branch'
in the managers
table. Now, when you attempt to merge DataFrames, an exception is thrown:
>>> pd.merge(revenue, managers, on='city')
Traceback (most recent call last):
... <text deleted> ...
pd.merge(revenue, managers, on='city')
... <text deleted> ...
KeyError: 'city'
Given this, it will take a bit more work for you to join or merge on the city/branch name. You have to specify the left_on
and right_on
parameters in the call to pd.merge()
.
As before, pandas
has been pre-imported as pd
and the revenue
and managers
DataFrames are in your namespace. They have been printed in the IPython Shell so you can examine the columns prior to merging.
Are you able to merge better than in the last exercise? How should the rows with Springfield
be handled?
Instructions
revenue
and managers
into a single DataFrame called combined
using the 'city'
and 'branch'
columns from the appropriate DataFrames.pd.merge()
, you will have to specify the parameters left_on
and right_on
appropriately.combined
.state_rev = {'Austin': 'TX', 'Denver': 'CO', 'Springfield': 'IL', 'Mendocino': 'CA'}
state_man = {'Austin': 'TX', 'Denver': 'CO', 'Mendocino': 'CA', 'Springfield': 'MO'}
revenue['state'] = revenue['city'].map(state_rev)
managers['state'] = managers['city'].map(state_man)
managers.rename(columns={'city': 'branch'}, inplace=True)
revenue
managers
combined = pd.merge(revenue, managers, left_on='city', right_on='branch')
combined
Another strategy to disambiguate cities with identical names is to add information on the states in which the cities are located. To this end, you add a column called state
to both DataFrames from the preceding exercises. Again, pandas
has been pre-imported as pd
and the revenue
and managers
DataFrames are in your namespace.
Your goal in this exercise is to use pd.merge()
to merge DataFrames using multiple columns (using 'branch_id'
, 'city'
, and 'state'
in this case).
Are you able to match all your company's branches correctly?
Instructions
'state'
in the DataFrame revenue
, consisting of the list ['TX','CO','IL','CA']
.'state'
in the DataFrame managers
, consisting of the list ['TX','CO','CA','MO']
.revenue
and managers
using three columns :'branch_id'
, 'city'
, and 'state'
. Pass them in as a list to the on
paramater of pd.merge()
.managers.rename(columns={'branch': 'city'}, inplace=True)
# Add 'state' column to revenue: revenue['state']
revenue['state'] = ['TX','CO','IL','CA']
# Add 'state' column to managers: managers['state']
managers['state'] = ['TX','CO','CA','MO']
# Merge revenue & managers on 'branch_id', 'city', & 'state': combined
combined = pd.merge(revenue, managers, on=['branch_id', 'city', 'state'])
# Print combined
print(combined)
del rev, man, revenue, managers, merge_by_city, merge_by_id, combined
bronze = pd.read_csv(so_bronze_file)
bronze.head()
len(bronze)
gold = pd.read_csv(so_gold_file)
gold.head()
len(gold)
merge()
does an inner join by defaulthow=innner
is the default behavior so_merge = pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='inner')
so_merge.head()
how=left
keeps all rows of the left DataFrame in the merged DataFramebronze = pd.read_csv(so_bronze5_file)
gold = pd.read_csv(so_gold5_file)
g_noc = ['USA', 'URS', 'GBR', 'ITA', 'GER']
b_noc = ['USA', 'URS', 'GBR', 'FRA', 'GER']
gold['NOC'] = g_noc
bronze['NOC'] = b_noc
gold
bronze
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='left')
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='right')
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='outer')
population = pd.DataFrame.from_dict({'Zip Code ZCTA': [57538, 59916, 37660, 2860],
'2010 Census Population': [322, 130, 40038, 45199]})
population.set_index('Zip Code ZCTA', inplace=True)
population
unemployment = pd.DataFrame.from_dict({'Zip': [2860, 46167, 1097],
'unemployment': [0.11, 0.02, 0.33],
'participants': [ 34447, 4800, 32]})
unemployment.set_index('Zip', inplace=True)
unemployment
population.join(unemployment)
population.join(unemployment, how='right')
population.join(unemployment, how='inner')
population.join(unemployment, how='outer')
del bronze, gold, so_merge, g_noc, b_noc, population, unemployment
rev = {'city': ['Austin', 'Denver', 'Springfield', 'Mendocino'],
'state': ['TX','CO','IL','CA'],
'revenue': [100, 83, 4, 200],
'branch_id': [10, 20, 30, 47]}
man = {'city': ['Austin', 'Denver', 'Mendocino', 'Springfield'],
'state': ['TX','CO','CA','MO'],
'manager': ['Charles', 'Joel', 'Brett', 'Sally'],
'branch_id': [10, 20, 47, 31]}
revenue = pd.DataFrame.from_dict(rev)
revenue.set_index('branch_id', inplace=True)
managers = pd.DataFrame.from_dict(man)
managers.set_index('branch_id', inplace=True)
revenue
managers
The DataFrames revenue
and managers
are displayed in the IPython Shell. Here, they are indexed by 'branch_id'
.
Choose the function call below that will join the DataFrames on their indexes and return 5 rows with index labels [10, 20, 30, 31, 47]
. Explore each of them in the IPython Shell to get a better understanding of their functionality.
revenue.join(managers, lsuffix='_rev', rsuffix='_mng', how='outer')
Suppose you have two DataFrames: students
(with columns 'StudentID'
, 'LastName'
, 'FirstName'
, and 'Major'
) and midterm_results
(with columns 'StudentID'
, 'Q1'
, 'Q2'
, and 'Q3'
for their scores on midterm questions).
You want to combine the DataFrames into a single DataFrame grades
, and be able to easily spot which students wrote the midterm and which didn't (their midterm question scores 'Q1'
, 'Q2'
, & 'Q3'
should be filled with NaN
values).
You also want to drop rows from midterm_results
in which the StudentID
is not found in students
.
Which of the following strategies gives the desired result?
students = pd.DataFrame.from_dict({'StudentID': [], 'LastName': [], 'FirstName': [], 'Major': []})
midterm_results = pd.DataFrame.from_dict({'StudentID': [], 'Q1': [], 'Q2': [], 'Q3': []})
students
midterm_results
grades = pd.merge(students, midterm_results, how='left')
You now have, in addition to the revenue
and managers
DataFrames from prior exercises, a DataFrame sales
that summarizes units sold from specific branches (identified by city
and state
but not branch_id
).
Once again, the managers
DataFrame uses the label branch
in place of city
as in the other two DataFrames. Your task here is to employ left and right merges to preserve data and identify where data is missing.
By merging revenue
and sales
with a right merge, you can identify the missing revenue
values. Here, you don't need to specify left_on
or right_on
because the columns to merge on have matching labels.
By merging sales
and managers
with a left merge, you can identify the missing manager
. Here, the columns to merge on have conflicting labels, so you must specify left_on
and right_on
. In both cases, you're looking to figure out how to connect the fields in rows containing Springfield
.
pandas
has been imported as pd
and the three DataFrames revenue
, managers
, and sales
have been pre-loaded. They have been printed for you to explore in the IPython Shell.
Instructions
pd.merge()
with revenue
and sales
to yield a new DataFrame revenue_and_sales
.how='right'
and on=['city', 'state']
.revenue_and_sales
. This has been done for you.sales
and managers
to yield a new DataFrame sales_and_managers
.how='left'
, left_on=['city', 'state']
, and right_on=['branch', 'state']
.sales_and_managers
. This has been done for you, so hit 'Submit Answer' to see the result!rev = {'city': ['Austin', 'Denver', 'Springfield', 'Mendocino'],
'branch_id': [10, 20, 30, 47],
'state': ['TX','CO','IL','CA'],
'revenue': [100, 83, 4, 200]}
man = {'branch': ['Austin', 'Denver', 'Mendocino', 'Springfield'],
'branch_id': [10, 20, 47, 31],
'state': ['TX','CO','CA','MO'],
'manager': ['Charles', 'Joel', 'Brett', 'Sally']}
sale = {'city': ['Mendocino', 'Denver', 'Austin', 'Springfield', 'Springfield'],
'state': ['CA', 'CO', 'TX', 'MO', 'IL'],
'units': [1, 4, 2, 5, 1]}
revenue = pd.DataFrame.from_dict(rev)
managers = pd.DataFrame.from_dict(man)
sales = pd.DataFrame.from_dict(sale)
revenue
managers
sales
# Merge revenue and sales: revenue_and_sales
revenue_and_sales = pd.merge(revenue, sales, how='right', on=['city', 'state'])
# Print revenue_and_sales
revenue_and_sales
sales_and_managers = pd.merge(sales, managers, how='left', left_on=['city', 'state'], right_on=['branch', 'state'])
# Print sales_and_managers
sales_and_managers
This exercise picks up where the previous one left off. The DataFrames revenue
, managers
, and sales
are pre-loaded into your namespace (and, of course, pandas
is imported as pd
). Moreover, the merged DataFrames revenue_and_sales
and sales_and_managers
have been pre-computed exactly as you did in the previous exercise.
The merged DataFrames contain enough information to construct a DataFrame with 5 rows with all known information correctly aligned and each branch listed only once. You will try to merge the merged DataFrames on all matching keys (which computes an inner join by default). You can compare the result to an outer join and also to an outer join with restricted subset of columns as keys.
Instructions
sales_and_managers
with revenue_and_sales
. Store the result as merge_default
.merge_default
. This has been done for you.sales_and_managers
with revenue_and_sales
using how='outer'
. Store the result as merge_outer
.merge_outer
. This has been done for you.sales_and_managers
with revenue_and_sales
only on ['city','state']
using an outer join. Store the result as merge_outer_on
and hit 'Submit Answer' to see what the merged DataFrames look like!# Perform the first merge: merge_default
merge_default = pd.merge(sales_and_managers, revenue_and_sales)
# Print merge_default
merge_default
# Perform the second merge: merge_outer
merge_outer = pd.merge(sales_and_managers, revenue_and_sales, how='outer')
# Print merge_outer
merge_outer
# Perform the third merge: merge_outer_on
merge_outer_on = pd.merge(sales_and_managers, revenue_and_sales, on=['city', 'state'], how='outer')
# Print merge_outer_on
merge_outer_on
del rev, man, revenue, managers, students, midterm_results, grades, sale, sales, revenue_and_sales, sales_and_managers, merge_default, merge_outer, merge_outer_on
software = pd.read_csv(sales_feb_software_file, parse_dates=['Date']).sort_values('Date')
software.head(10)
hardware = pd.read_csv(sales_feb_hardware_file, parse_dates=['Date']).sort_values('Date')
hardware.head()
sales_merge = pd.merge(hardware, software)
sales_merge
sales_merge.info()
sales_merge = pd.merge(hardware, software, how='outer')
sales_merge.head(14)
sales_merge = pd.merge(hardware, software, how='outer').sort_values('Date')
sales_merge.head(14)
sales_merged = pd.merge_ordered(hardware, software)
sales_merged.head(14)
sales_merged = pd.merge_ordered(hardware, software, on=['Date', 'Company'], suffixes=['_hardware', '_software'])
sales_merged.head()
stocks_dir = Path.cwd() / 'data' / 'stocks'
sp500_stocks = stocks_dir / 'SP500.csv'
aapl = stocks_dir/ 'AAPL.csv'
csco = stocks_dir/ 'CSCO.csv'
amzn = stocks_dir/ 'AMZN.csv'
msft = stocks_dir/ 'MSFT.csv'
ibm = stocks_dir/ 'IBM.csv'
sp500_df = pd.read_csv(sp500_stocks, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
aapl_df = pd.read_csv(aapl, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
csco_df = pd.read_csv(csco, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
amzn_df = pd.read_csv(amzn, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
msft_df = pd.read_csv(msft, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
ibm_df = pd.read_csv(ibm, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
sp500_df.rename(columns={'Close': 'S&P'}, inplace=True)
aapl_df.rename(columns={'Close': 'AAPL'}, inplace=True)
csco_df.rename(columns={'Close': 'CSCO'}, inplace=True)
amzn_df.rename(columns={'Close': 'AMZN'}, inplace=True)
msft_df.rename(columns={'Close': 'MSFT'}, inplace=True)
ibm_df.rename(columns={'Close': 'IBM'}, inplace=True)
stocks = pd.concat([sp500_df, aapl_df, csco_df, amzn_df, msft_df, ibm_df], axis=1)
stocks.head()
stocks.tail()
stocks.to_csv(stocks_dir / 'stocks.csv', index=True, index_label='Date')
gdp = pd.read_csv(gdp_usa_file, parse_dates=['DATE'])
gdp.sort_values(by=['DATE'], ascending=False, inplace=True)
gdp.reset_index(inplace=True, drop=True)
gdp.rename(columns={'VALUE': 'GDP', 'DATE': 'Date'}, inplace=True)
gdp.head(8)
gdp_2000_2015 = gdp[(gdp['Date'].dt.year >= 2000) & (gdp['Date'].dt.year <= 2015)]
stocks.reset_index(inplace=True)
stocks.head(5)
stocks_2000_2015 = stocks[(stocks['Date'].dt.year >= 2000) & (stocks['Date'].dt.year <= 2015)]
ordered_df = pd.merge_ordered(stocks_2000_2015, gdp_2000_2015, on='Date')
ordered_df.head()
ordered_df = pd.merge_ordered(stocks_2000_2015, gdp_2000_2015, on='Date', fill_method='ffill')
ordered_df.head()
del software, hardware, sales_merge, sales_merged, stocks_dir, sp500_stocks, aapl
del csco, amzn, msft, ibm, sp500_df, aapl_df, csco_df, amzn_df, msft_df, ibm_df, stocks
del gdp, gdp_2000_2015, stocks_2000_2015, ordered_df
This exercise uses pre-loaded DataFrames austin
and houston
that contain weather data from the cities Austin and Houston respectively. They have been printed in the IPython Shell for you to examine.
Weather conditions were recorded on separate days and you need to merge these two DataFrames together such that the dates are ordered. To do this, you'll use pd.merge_ordered()
. After you're done, note the order of the rows before and after merging.
Instructions
austin
and houston
using pd.merge_ordered()
. Store the result as tx_weather
.tx_weather
. You should notice that the rows are sorted by the date but it is not possible to tell which observation came from which city.austin
and houston
.on='date'
and suffixes=['_aus','_hus']
so that the rows can be distinguished. Store the result as tx_weather_suff
.tx_weather_suff
to examine its contents. This has been done for you.austin
and houston
.on
and suffixes
parameters, specify the keyword argument fill_method='ffill'
to use forward-filling to replace NaN
entries with the most recent non-null entry, and hit 'Submit Answer' to examine the contents of the merged DataFrames!austin = pd.DataFrame.from_dict({'date': ['2016-01-01', '2016-02-08', '2016-01-17'], 'ratings': ['Cloudy', 'Cloudy', 'Sunny']})
houston = pd.DataFrame.from_dict({'date': ['2016-01-04', '2016-01-01', '2016-03-01'], 'ratings': ['Rainy', 'Cloudy', 'Sunny']})
# Perform the first ordered merge: tx_weather
tx_weather = pd.merge_ordered(austin, houston)
# Print tx_weather
tx_weather
# Perform the second ordered merge: tx_weather_suff
tx_weather_suff = pd.merge_ordered(austin, houston, on='date', suffixes=['_aus','_hus'])
# Print tx_weather_suff
tx_weather_suff
# Perform the third ordered merge: tx_weather_ffill
tx_weather_ffill = pd.merge_ordered(austin, houston, on='date', suffixes=['_aus','_hus'], fill_method='ffill')
# Print tx_weather_ffill
tx_weather_ffill
del austin, houston, tx_weather, tx_weather_suff, tx_weather_ffill
Similar to pd.merge_ordered()
, the pd.merge_asof()
function will also merge values in order using the on
column, but for each row in the left DataFrame, only rows from the right DataFrame whose 'on'
column values are less than the left value will be kept.
This function can be used to align disparate datetime frequencies without having to first resample.
Here, you'll merge monthly oil prices (US dollars) into a full automobile fuel efficiency dataset. The oil and automobile DataFrames have been pre-loaded as oil
and auto
. The first 5 rows of each have been printed in the IPython Shell for you to explore.
These datasets will align such that the first price of the year will be broadcast into the rows of the automobiles DataFrame. This is considered correct since by the start of any given year, most automobiles for that year will have already been manufactured.
You'll then inspect the merged DataFrame, resample by year and compute the mean 'Price'
and 'mpg'
. You should be able to see a trend in these two columns, that you can confirm by computing the Pearson correlation between resampled 'Price'
and 'mpg'
.
Instructions
auto
and oil
using pd.merge_asof()
with left_on='yr'
and ight_on='Date'
. Store the result as merged.merged
. This has been done for you.merged
using 'A'
(annual frequency), and on='Date'
. Select [['mpg','Price']]
and aggregate the mean. Store the result as yearly
.yearly
and yearly.corr()
, which shows the Pearson correlation between the resampled 'Price'
and 'mpg'
.oil = pd.read_csv(oil_price_file, parse_dates=['Date'])
auto = pd.read_csv(auto_fuel_file, parse_dates=['yr'])
oil.head()
auto.head()
# Merge auto and oil: merged
merged = pd.merge_asof(auto, oil, left_on='yr', right_on='Date')
# Print the tail of merged
merged.tail()
# Resample merged: yearly
yearly = merged.resample('A', on='Date')[['mpg','Price']].mean()
# Print yearly
yearly
# print yearly.corr()
yearly.corr()
To cement your new skills, you'll apply them by working on an in-depth study involving Olympic medal data. The analysis involves integrating your multi-DataFrame skills from this course and also skills you've gained in previous pandas courses. This is a rich dataset that will allow you to fully leverage your pandas data manipulation skills. Enjoy!
pd.read_csv(so_ioc_codes_file).head(8)
pd.read_csv(so_editions_file, sep='\t').head(8)
pd.read_csv(so_all_medalists_file, sep='\t', header=4).head(8)
In this chapter, you'll be using The Guardian's Olympic medal dataset.
Your first task here is to prepare a DataFrame editions
from a tab-separated values (TSV) file.
Initially, editions
has 26 rows (one for each Olympic edition, i.e., a year in which the Olympics was held) and 7 columns: 'Edition'
, 'Bronze'
, 'Gold'
, 'Silver'
, 'Grand Total'
, 'City'
, and 'Country'
.
For the analysis that follows, you won't need the overall medal counts, so you want to keep only the useful columns from editions
: 'Edition'
, 'Grand Total'
, City
, and Country
.
Instructions
file_path
into a DataFrame called editions
. The identifier file_path
has been pre-defined with the filename 'Summer Olympic medallists 1896 to 2008 - EDITIONS.tsv'
. You'll have to use the option sep='\t'
because the file uses tabs to delimit fields (pd.read_csv()
expects commas by default).'Edition'
, 'Grand Total'
, 'City'
, and 'Country'
from editions
.editions
in entirety (there are only 26 rows). This has been done for you, so hit 'Submit Answer' to see the result!editions = pd.read_csv(so_editions_file, sep='\t')
editions = editions[['Edition', 'Grand Total', 'City', 'Country']]
editions.head()
Your task here is to prepare a DataFrame ioc_codes
from a comma-separated values (CSV) file.
Initially, ioc_codes
has 200 rows (one for each country) and 3 columns: 'Country'
, 'NOC'
, & 'ISO code'
.
For the analysis that follows, you want to keep only the useful columns from ioc_codes: 'Country'
and 'NOC'
(the column 'NOC'
contains three-letter codes representing each country).
Instructions
file_path
into a DataFrame called ioc_codes
. The identifier file_path
has been pre-defined with the filename 'Summer Olympic medallists 1896 to 2008 - IOC COUNTRY CODES.csv'
.'Country'
and 'NOC'
from ioc_codes
.ioc_codes
(there are 200 rows in total). This has been done for you, so hit 'Submit Answer' to see the result!ioc_codes = pd.read_csv(so_ioc_codes_file)
ioc_codes = ioc_codes[['Country', 'NOC']]
ioc_codes.head()
Here, you'll start with the DataFrame editions from the previous exercise.
You have a sequence of files summer_1896.csv, summer_1900.csv, ..., summer_2008.csv, one for each Olympic edition (year).
You will build up a dictionary medals_dict with the Olympic editions (years) as keys and DataFrames as values.
The dictionary is built up inside a loop over the year of each Olympic edition (from the Index of editions).
Once the dictionary of DataFrames is built up, you will combine the DataFrames using pd.concat().
Instructions
for year in editions['Edition']:
# Create the file path: file_path
file_path = 'summer_{:d}.csv'.format(year)
# Load file_path into a DataFrame: medals_dict[year]
medals_dict[year] = pd.read_csv(file_path)
# Extract relevant columns: medals_dict[year]
medals_dict[year] = medals_dict[year][['Athlete', 'NOC', 'Medal']]
# Assign year to column 'Edition' of medals_dict
medals_dict[year]['Edition'] = year
# Concatenate medals_dict: medals
medals = pd.concat(medals_dict, ignore_index=True)
medals = pd.read_csv(so_all_medalists_file, sep='\t', header=4)
medals = medals[['Athlete', 'NOC', 'Medal', 'Edition']]
medals.head()
Here, you'll start with the concatenated DataFrame medals
from the previous exercise.
You can construct a pivot table to see the number of medals each country won in each year. The result is a new DataFrame with the Olympic edition on the Index and with 138 country NOC
codes as columns. If you want a refresher on pivot tables, it may be useful to refer back to the relevant exercises in Manipulating DataFrames with pandas.
Instructions
medals
, aggregating by count
(by specifying the aggfunc
parameter). Use 'Edition'
as the index
, 'Athlete'
for the values
, and 'NOC'
for the columns
.medal_counts
. This has been done for you, so hit 'Submit Answer' to see the results!# Construct the pivot_table: medal_counts
medal_counts = medals.pivot_table(index='Edition', columns='NOC', values='Athlete', aggfunc='count')
# Print the first & last 5 rows of medal_counts
medal_counts.head()
medal_counts.tail()
In this exercise, you'll start with the DataFrames editions
, medals
, & medal_counts
from prior exercises.
You can extract a Series with the total number of medals awarded in each Olympic edition.
The DataFrame medal_counts
can be divided row-wise by the total number of medals awarded each edition; the method .divide()
performs the broadcast as you require.
This gives you a normalized indication of each country's performance in each edition.
Instructions
editions
to be 'Edition'
(using the method .set_index()
). Save the result as totals
.'Grand Total'
column from totals
and assign the result back to totals
.medal_counts
by totals
along each row. You will have to use the .divide()
method with the option axis='rows'
. Assign the result to fractions
.fractions
. This has been done for you, so hit 'Submit Answer' to see the results!# Set Index of editions: totals
totals = editions.set_index('Edition')
totals.head()
# Reassign totals['Grand Total']: totals
totals = totals['Grand Total']
totals.head()
# Divide medal_counts by totals: fractions
fractions = medal_counts.divide(totals, axis='rows')
# Print first & last 5 rows of fractions
fractions.head()
fractions.tail()
Here, you'll start with the DataFrames editions
, medals
, medal_counts
, & fractions
from prior exercises.
To see if there is a host country advantage, you first want to see how the fraction of medals won changes from edition to edition.
The expanding mean provides a way to see this down each column. It is the value of the mean with all the data available up to that point in time. If you are interested in learning more about pandas' expanding transformations, this section of the pandas documentation has additional information.
Instructions
mean_fractions
by chaining the methods .expanding().mean()
to fractions
.mean_fractions
down each column by applying .pct_change()
and multiplying by 100
. Assign the result to fractions_change
.fractions_change
using the .reset_index()
method. This will make 'Edition'
an ordinary column.fractions_change
. This has been done for you, so hit 'Submit Answer' to see the results!# Apply the expanding mean: mean_fractions
mean_fractions = fractions.expanding().mean()
mean_fractions.head()
# Compute the percentage change: fractions_change
fractions_change = mean_fractions.pct_change()*100
fractions_change.head()
# Reset the index of fractions_change: fractions_change
fractions_change = fractions_change.reset_index()
# Print first & last 5 rows of fractions_change
fractions_change.head()
fractions_change.tail()