Advanced Techniques for Merging DataFrames with pandas

Posted Mar 23, 2019 Updated May 2, 2024

By Trenton McKinney 150 min read

Introduction and Setup

Course: DataCamp: Merging DataFrames with pandas
This notebook was created as a reproducible reference.
The material is from the course
I completed the exercises
If you find the content beneficial, consider a DataCamp Subscription.

Synopsis

The code cells that follow have been updated to pandas v2.2.2, so some code does not match associated instructions, but these updates are minor.

Here is a concise synopsis of each major section within the document:

Preparing Data

Discusses various techniques for importing multiple files into DataFrames.
Focuses on how to use Indexes to share information between DataFrames.
Covers essential commands for reading data files such as pd.read_csv() and highlights their importance for subsequent merging tasks.

Concatenating Data

Explores database-style operations to append and concatenate DataFrames using real-world datasets.
Describes methods like .append() and .concat() which stack rows or join DataFrames along an axis.
- Some instructions reference .append(), which is removed in favor of .concat().
Emphasizes the handling of indices during the concatenation process and introduces the concept of hierarchical indexing.

Merging Data

Provides an in-depth look at merging techniques in pandas, including different types of joins (left, right, inner, outer).
Explains the use of merge() function to align rows using one or more columns.
Discusses ordered merging, which is particularly useful when dealing with columns that have a natural ordering, like dates.

Case Study - Summer Olympics

Applies previously discussed DataFrame skills to analyze Olympic medal data, integrating lessons from both the current and prior pandas courses.
Uses the dataset of Summer Olympic medalists from 1896 to 2008 to showcase data manipulation capabilities in pandas.

Each section builds on the knowledge from the previous, culminating in a practical case study that utilizes all the discussed DataFrame manipulation techniques.

Imports

  
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pprint import pprint as pp
import csv
from pathlib import Path
import yfinance as yf

  
print(f'Pandas Version: {pd.__version__}')
print(f'Matplotlib Version: {plt.matplotlib.__version__}')
print(f'Numpy Version: {np.__version__}')
print(f'Yahoo Finance Version: {yf.__version__}')

Pandas Version: 2.2.2
Matplotlib Version: 3.8.4
Numpy Version: 1.26.4
Yahoo Finance Version: 0.2.38

Pandas Configuration Options

  
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 300)
pd.set_option('display.expand_frame_repr', True)

Data Files Location

Most data files for the exercises can be found on the course site
Other data files may be found in my DataCamp repository

Data File Objects

  
data = Path.cwd() / 'data' / 'merging-dataframes-with-pandas'
auto_fuel_file = data / 'auto_fuel_efficiency.csv'
baby_1881_file = data / 'baby_names1881.csv'
baby_1981_file = data / 'baby_names1981.csv'
exch_rates_file = data / 'exchange_rates.csv'
gdp_china_file = data / 'gdp_china.csv'
gdp_usa_file = data / 'gdp_usa.csv'
oil_price_file = data / 'oil_price.csv'
pitts_file = data / 'pittsburgh_weather_2013.csv'
sales_feb_hardware_file = data / 'sales-feb-Hardware.csv'
sales_feb_service_file = data / 'sales-feb-Service.csv'
sales_feb_software_file = data / 'sales-feb-Software.csv'
sales_jan_2015_file = data / 'sales-jan-2015.csv'
sales_feb_2015_file = data / 'sales-feb-2015.csv'
sales_mar_2015_file = data / 'sales-mar-2015.csv'
sp500_file = data / 'sp500.csv'
so_bronze_file = data / 'summer_olympics_Bronze.csv'
so_bronze5_file = data / 'summer_olympics_bronze_top5.csv'
so_gold_file = data / 'summer_olympics_Gold.csv'
so_gold5_file = data / 'summer_olympics_gold_top5.csv'
so_silver_file = data / 'summer_olympics_Silver.csv'
so_silver5_file = data / 'summer_olympics_silver_top5.csv'
so_all_medalists_file = data / 'summer_olympics_medalists 1896 to 2008 - ALL MEDALISTS.tsv'
so_editions_file = data / 'summer_olympics_medalists 1896 to 2008 - EDITIONS.tsv'
so_ioc_codes_file = data / 'summer_olympics_medalists 1896 to 2008 - IOC COUNTRY CODES.csv'

Course Description

As a Data Scientist, you’ll often find that the data you need is not in a single file. It may be spread across a number of text files, spreadsheets, or databases. You want to be able to import the data of interest as a collection of DataFrames and figure out how to combine them to answer your central questions. This course is all about the act of combining, or merging, DataFrames, an essential part of any working Data Scientist’s toolbox. You’ll hone your pandas skills by learning how to organize, reshape, and aggregate multiple data sets to answer your specific questions.

Preparing Data

In this chapter, you’ll learn about different techniques you can use to import multiple files into DataFrames. Having imported your data into individual DataFrames, you’ll then learn how to share information between DataFrames using their Indexes. Understanding how Indexes work is essential information that you’ll need for merging DataFrames later in the course.

Reading multiple data files

Tools for pandas data import

pd.read_csv() for CSV files
- df = pd.read_csv(filepath)
- dozens of optional input parameters
Other data import tools:
- pd.read_excel()
- pd.read_html()
- pd.read_json()

Loading separate files

  
import pandas as pd
dataframe0 = pd.read_csv('sales-jan-2015.csv')
dataframe1 = pd.read_csv('sales-feb-2015.csv')

Using a loop

  
filenames = ['sales-jan-2015.csv', 'sales-feb-2015.csv']
dataframes = []
for f in filenames:
    dataframes.append(pd.read_csv(f))

Using a comprehension

  
filenames = ['sales-jan-2015.csv', 'sales-feb-2015.csv']
dataframes = [pd.read_csv(f) for f in filenames]

Using glob

  
from glob import glob
filenames = glob('sales*.csv')
dataframes = [pd.read_csv(f) for f in filenames]

Exercises

Reading DataFrames from multiple files

When data is spread among several files, you usually invoke pandas’ read_csv() (or a similar data import function) multiple times to load the data into several DataFrames.

The data files for this example have been derived from a list of Olympic medals awarded between 1896 & 2008 compiled by the Guardian.

The column labels of each DataFrame are NOC, Country, & Total where NOC is a three-letter code for the name of the country and Total is the number of medals of that type won (bronze, silver, or gold).

Instructions

Import pandas as pd.
Read the file 'Bronze.csv' into a DataFrame called bronze.
Read the file 'Silver.csv' into a DataFrame called silver.
Read the file 'Gold.csv' into a DataFrame called gold.
Print the first 5 rows of the DataFrame gold. This has been done for you, so hit 'Submit Answer' to see the results.

  
# Read 'Bronze.csv' into a DataFrame: bronze
bronze = pd.read_csv(so_bronze_file)

# Read 'Silver.csv' into a DataFrame: silver
silver = pd.read_csv(so_silver_file)

# Read 'Gold.csv' into a DataFrame: gold
gold = pd.read_csv(so_gold_file)

# Print the first five rows of gold
gold.head()

	NOC	Country	Total
0	USA	United States	2088.0
1	URS	Soviet Union	838.0
2	GBR	United Kingdom	498.0
3	FRA	France	378.0
4	GER	Germany	407.0

Reading DataFrames from multiple files in a loop

As you saw in the video, loading data from multiple files into DataFrames is more efficient in a loop or a list comprehension.

Notice that this approach is not restricted to working with CSV files. That is, even if your data comes in other formats, as long as pandas has a suitable data import function, you can apply a loop or comprehension to generate a list of DataFrames imported from the source files.

Here, you’ll continue working with The Guardian’s Olympic medal dataset.

Instructions

Create a list of file names called filenames with three strings 'Gold.csv', 'Silver.csv', & 'Bronze.csv'. This has been done for you.
Use a for loop to create another list called dataframes containing the three DataFrames loaded from filenames:
- Iterate over filenames.
- Read each CSV file in filenames into a DataFrame and append it to dataframes by using pd.read_csv() inside a call to .append().
Print the first 5 rows of the first DataFrame of the list dataframes. This has been done for you, so hit ‘Submit Answer’ to see the results.

  
# Create the list of file names: filenames
filenames = [so_bronze_file, so_silver_file, so_gold_file]

# Create the list of three DataFrames: dataframes
dataframes = []
for filename in filenames:
    dataframes.append(pd.read_csv(filename))

# Print top 5 rows of 1st DataFrame in dataframes
dataframes[0].head()

	NOC	Country	Total
0	USA	United States	1052.0
1	URS	Soviet Union	584.0
2	GBR	United Kingdom	505.0
3	FRA	France	475.0
4	GER	Germany	454.0

Combining DataFrames from multiple data files

In this exercise, you’ll combine the three DataFrames from earlier exercises - gold, silver, & bronze - into a single DataFrame called medals. The approach you’ll use here is clumsy. Later on in the course, you’ll see various powerful methods that are frequently used in practice for concatenating or merging DataFrames.

Remember, the column labels of each DataFrame are NOC, Country, and Total, where NOC is a three-letter code for the name of the country and Total is the number of medals of that type won.

Instructions

Construct a copy of the DataFrame gold called medals using the .copy() method.
Create a list called new_labels with entries 'NOC', 'Country', & 'Gold'. This is the same as the column labels from gold with the column label 'Total' replaced by 'Gold'.
Rename the columns of medals by assigning new_labels to medals.columns.
Create new columns 'Silver' and 'Bronze' in medals using silver['Total'] & bronze['Total'].
Print the top 5 rows of the final DataFrame medals. This has been done for you, so hit ‘Submit Answer’ to see the result!

  
# Make a copy of gold: medals
medals = gold.copy()

# Create list of new column labels: new_labels
new_labels = ['NOC', 'Country', 'Gold']

# Rename the columns of medals using new_labels
medals.columns = new_labels

# Add columns 'Silver' & 'Bronze' to medals
medals['Silver'] = silver['Total']
medals['Bronze'] = bronze['Total']

# Print the head of medals
medals.head()

	NOC	Country	Gold	Silver	Bronze
0	USA	United States	2088.0	1195.0	1052.0
1	URS	Soviet Union	838.0	627.0	584.0
2	GBR	United Kingdom	498.0	591.0	505.0
3	FRA	France	378.0	461.0	475.0
4	GER	Germany	407.0	350.0	454.0

  
del bronze, silver, gold, dataframes, medals

Reindexing DataFrames

“Indexes” vs. “Indices”

indices: many index labels within Index data structures
indexes: many pandas Index data structures

Importing weather data

  
import pandas as pd
w_mean = pd.read_csv('quarterly_mean_temp.csv', index_col='Month')
w_max = pd.read_csv('quarterly_max_temp.csv', index_col='Month')

Examining the data

  
print(w_mean)
        Mean TemperatureF
Month
Apr     61.956044
Jan     32.133333
Jul     68.934783
Oct     43.434783

print(w_max)
        Max TemperatureF
Month
Jan     68
Apr     89
Jul     91
Oct     84

The DataFrame indexes

  
print(w_mean.index)
Index(['Apr', 'Jan', 'Jul', 'Oct'], dtype='object', name='Month')

print(w_max.index)
Index(['Jan', 'Apr', 'Jul', 'Oct'], dtype='object', name='Month')

print(type(w_mean.index))
<class 'pandas.indexes.base.Index'>

Using .reindex()

  
ordered = ['Jan', 'Apr', 'Jul', 'Oct']
w_mean2 = w_mean.reindex(ordered)
print(w_mean2)

        Mean TemperatureF
Month
Jan     32.133333
Apr     61.956044
Jul     68.934783
Oct     43.434783

Using .sort_index()

  
w_mean2.sort_index()
        Mean TemperatureF
Month
Apr     61.956044
Jan     32.133333
Jul     68.934783
Oct     43.434783

Reindex from a DataFrame Index

  
w_mean.reindex(w_max.index)
        Mean TemperatureF
Month
Jan     32.133333
Apr     61.956044
Jul     68.934783
Oct     43.434783

Reindexing with missing labels

  
w_mean3 = w_mean.reindex(['Jan', 'Apr', 'Dec'])
print(w_mean3)
        Mean TemperatureF
Month
Jan     32.133333
Apr     61.956044
Dec     NaN

Reindex from a DataFrame Index

  
w_max.reindex(w_mean3.index)
        Max TemperatureF
Month
Jan     68.0
Apr     89.0
Dec     NaN

w_max.reindex(w_mean3.index).dropna()
        Max TemperatureF
Month
Jan     68.0
Apr     89.0

Order matters

  
w_max.reindex(w_mean.index)
        Max TemperatureF
Month
Apr     89
Jan     68
Jul     91
Oct     84

w_mean.reindex(w_max.index)
        Mean TemperatureF
Month
Jan     32.133333
Apr     61.956044
Jul     68.934783
Oct     43.434783

Exercises

Sorting DataFrame with the Index & columns

It is often useful to rearrange the sequence of the rows of a DataFrame by sorting. You don’t have to implement these yourself; the principal methods for doing this are .sort_index() and .sort_values().

In this exercise, you’ll use these methods with a DataFrame of temperature values indexed by month names. You’ll sort the rows alphabetically using the Index and numerically using a column. Notice, for this data, the original ordering is probably most useful and intuitive: the purpose here is for you to understand what the sorting methods do.

Instructions

Read 'monthly_max_temp.csv' into a DataFrame called weather1 with 'Month' as the index.
Sort the index of weather1 in alphabetical order using the .sort_index() method and store the result in weather2.
Sort the index of weather1 in reverse alphabetical order by specifying the additional keyword argument ascending=False inside .sort_index().
Use the .sort_values() method to sort weather1 in increasing numerical order according to the values of the column 'Max TemperatureF'.

  
monthly_max_temp = {'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
                    'Max TemperatureF': [68, 60, 68, 84, 88, 89, 91, 86, 90, 84, 72, 68]}

  
# Read 'monthly_max_temp.csv' into a DataFrame: weather1
# weather1 = pd.read_csv('monthly_max_temp.csv', index_col='Month')
weather1 = pd.DataFrame.from_dict(monthly_max_temp)
weather1.set_index('Month', inplace=True)

# Print the head of weather1
print(weather1.head())

# Sort the index of weather1 in alphabetical order: weather2
weather2 = weather1.sort_index()

# Print the head of weather2
print(weather2.head())

# Sort the index of weather1 in reverse alphabetical order: weather3
weather3 = weather1.sort_index(ascending=False)

# Print the head of weather3
print(weather3.head())

# Sort weather1 numerically using the values of 'Max TemperatureF': weather4
weather4 = weather1.sort_values(by='Max TemperatureF')

# Print the head of weather4
print(weather4.head())

       Max TemperatureF
Month                  
Jan                  68
Feb                  60
Mar                  68
Apr                  84
May                  88
       Max TemperatureF
Month                  
Apr                  84
Aug                  86
Dec                  68
Feb                  60
Jan                  68
       Max TemperatureF
Month                  
Sep                  90
Oct                  84
Nov                  72
May                  88
Mar                  68
       Max TemperatureF
Month                  
Feb                  60
Jan                  68
Mar                  68
Dec                  68
Nov                  72

Reindexing DataFrame from a list

Sorting methods are not the only way to change DataFrame Indexes. There is also the .reindex() method.

In this exercise, you’ll reindex a DataFrame of quarterly-sampled mean temperature values to contain monthly samples (this is an example of upsampling or increasing the rate of samples, which you may recall from the pandas Foundations course).

The original data has the first month’s abbreviation of the quarter (three-month interval) on the Index, namely Apr, Jan, Jul, and Oct. This data has been loaded into a DataFrame called weather1 and has been printed in its entirety in the IPython Shell. Notice it has only four rows (corresponding to the first month of each quarter) and that the rows are not sorted chronologically.

You’ll initially use a list of all twelve month abbreviations and subsequently apply the .ffill() method to forward-fill the null entries when upsampling. This list of month abbreviations has been pre-loaded as year.

Instructions

Reorder the rows of weather1 using the .reindex() method with the list year as the argument, which contains the abbreviations for each month.
Reorder the rows of weather1 just as you did above, this time chaining the .ffill() method to replace the null values with the last preceding non-null value.

  
monthly_max_temp = {'Month': ['Jan', 'Apr', 'Jul', 'Oct'],
                    'Max TemperatureF': [32.13333, 61.956044, 68.934783, 43.434783]}
weather1 = pd.DataFrame.from_dict(monthly_max_temp)
weather1.set_index('Month', inplace=True)
weather1

	Max TemperatureF
Month
Jan	32.133330
Apr	61.956044
Jul	68.934783
Oct	43.434783

  
year = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Reindex weather1 using the list year: weather2
weather2 = weather1.reindex(year)

# Print weather2
weather2

	Max TemperatureF
Month
Jan	32.133330
Feb	NaN
Mar	NaN
Apr	61.956044
May	NaN
Jun	NaN
Jul	68.934783
Aug	NaN
Sep	NaN
Oct	43.434783
Nov	NaN
Dec	NaN

  
# Reindex weather1 using the list year with forward-fill: weather3
weather3 = weather1.reindex(year).ffill()

# Print weather3
weather3

	Max TemperatureF
Month
Jan	32.133330
Feb	32.133330
Mar	32.133330
Apr	61.956044
May	61.956044
Jun	61.956044
Jul	68.934783
Aug	68.934783
Sep	68.934783
Oct	43.434783
Nov	43.434783
Dec	43.434783

Reindexing DataFrame using another DataFrame Index

Another common technique is to reindex a DataFrame using the Index of another DataFrame. The DataFrame .reindex() method can accept the Index of a DataFrame or Series as input. You can access the Index of a DataFrame with its .index attribute.

The Baby Names Dataset from data.gov summarizes counts of names (with genders) from births registered in the US since 1881. In this exercise, you will start with two baby-names DataFrames names_1981 and names_1881 loaded for you.

The DataFrames names_1981 and names_1881 both have a MultiIndex with levels name and gender giving unique labels to counts in each row. If you’re interested in seeing how the MultiIndexes were set up, names_1981 and names_1881 were read in using the following commands:

  
names_1981 = pd.read_csv('names1981.csv', header=None, names=['name','gender','count'], index_col=(0,1))
names_1881 = pd.read_csv('names1881.csv', header=None, names=['name','gender','count'], index_col=(0,1))

As you can see by looking at their shapes, which have been printed in the IPython Shell, the DataFrame corresponding to 1981 births is much larger, reflecting the greater diversity of names in 1981 as compared to 1881.

Your job here is to use the DataFrame .reindex() and .dropna() methods to make a DataFrame common_names counting names from 1881 that were still popular in 1981.

Instructions

Create a new DataFrame common_names by reindexing names_1981 using the Index of the DataFrame names_1881 of older names.
Print the shape of the new common_names DataFrame. This has been done for you. It should be the same as that of names_1881.
Drop the rows of common_names that have null counts using the .dropna() method. These rows correspond to names that fell out of fashion between 1881 & 1981.
Print the shape of the reassigned common_names DataFrame. This has been done for you, so hit ‘Submit Answer’ to see the result!

  
names_1981 = pd.read_csv(baby_1981_file, header=None, names=['name', 'gender', 'count'], index_col=(0,1))
names_1981.head()

		count
name	gender
Jennifer	F	57032
Jessica	F	42519
Amanda	F	34370
Sarah	F	28162
Melissa	F	28003

  
names_1881 = pd.read_csv(baby_1881_file, header=None, names=['name','gender','count'], index_col=(0,1))
names_1881.head()

		count
name	gender
Mary	F	6919
Anna	F	2698
Emma	F	2034
Elizabeth	F	1852
Margaret	F	1658

  
# Reindex names_1981 with index of names_1881: common_names
common_names = names_1981.reindex(names_1881.index)

# Print shape of common_names
common_names.shape

(1935, 1)

  
# Drop rows with null counts: common_names
common_names = common_names.dropna()

# Print shape of new common_names
common_names.shape

(1587, 1)

  
common_names.head(10)

		count
name	gender
Mary	F	11030.0
Anna	F	5182.0
Emma	F	532.0
Elizabeth	F	20168.0
Margaret	F	2791.0
Minnie	F	56.0
Ida	F	206.0
Annie	F	973.0
Bertha	F	209.0
Alice	F	745.0

  
del weather1, weather2, weather3, weather4, common_names, names_1881, names_1981

Arithmetic with Series & DataFrames

Loading weather data

  
import pandas as pd
weather = pd.read_csv('pittsburgh2013.csv', index_col='Date', parse_dates=True)
weather.loc['2013-7-1':'2013-7-7', 'PrecipitationIn']

Date
2013-07-01 0.18
2013-07-02 0.14
2013-07-03 0.00
2013-07-04 0.25
2013-07-05 0.02
2013-07-06 0.06
2013-07-07 0.10
Name: PrecipitationIn, dtype: float64

Scalar multiplication

  
weather.loc['2013-07-01':'2013-07-07', 'PrecipitationIn'] * 2.54

Date
2013-07-01 0.4572
2013-07-02 0.3556
2013-07-03 0.0000
2013-07-04 0.6350
2013-07-05 0.0508
2013-07-06 0.1524
2013-07-07 0.2540
Name: PrecipitationIn, dtype: float64

Absolute temperature range

  
week1_range = weather.loc['2013-07-01':'2013-07-07', ['Min TemperatureF', 'Max TemperatureF']]
print(week1_range)
Min TemperatureF Max TemperatureF
Date
2013-07-01 66 79
2013-07-02 66 84
2013-07-03 71 86
2013-07-04 70 86
2013-07-05 69 86
2013-07-06 70 89
2013-07-07 70 77

Average temperature

  
week1_mean = weather.loc['2013-07-01':'2013-07-07', 'Mean TemperatureF']
print(week1_mean)
Date
2013-07-01 72
2013-07-02 74
2013-07-03 78
2013-07-04 77
2013-07-05 76
2013-07-06 78
2013-07-07 72
Name: Mean TemperatureF, dtype: int64

Relative temperature range

  
week1_range / week1_mean
RuntimeWarning: Cannot compare type 'Timestamp' with type 'str', sort order is
undefined for incomparable objects
return this.join(other, how=how, return_indexers=return_indexers)

2013-07-01 00:00:00 2013-07-02 00:00:00 2013-07-03 00:00:00 \
Date
2013-07-01 NaN NaN NaN
2013-07-02 NaN NaN NaN
2013-07-03 NaN NaN NaN
2013-07-04 NaN NaN NaN
2013-07-05 NaN NaN NaN
2013-07-06 NaN NaN NaN
2013-07-07 NaN NaN NaN
2013-07-04 00:00:00 2013-07-05 00:00:00 2013-07-06 00:00:00 \
Date
2013-07-01 NaN NaN NaN
... ...

Relative temperature range

  
week1_range.divide(week1_mean, axis='rows')

Min TemperatureF Max TemperatureF
Date
2013-07-01 0.916667 1.097222
2013-07-02 0.891892 1.135135
2013-07-03 0.910256 1.102564
2013-07-04 0.909091 1.116883
2013-07-05 0.907895 1.131579
2013-07-06 0.897436 1.141026
2013-07-07 0.972222 1.069444

Percentage changes

  
week1_mean.pct_change() * 100

Date
2013-07-01 NaN
2013-07-02 2.777778
2013-07-03 5.405405
2013-07-04 -1.282051
2013-07-05 -1.298701
2013-07-06 2.631579
2013-07-07 -7.692308
Name: Mean TemperatureF, dtype: float64

Bronze Olympic medals

  
bronze = pd.read_csv('bronze_top5.csv', index_col=0)
print(bronze)
Total
Country
United States 1052.0
Soviet Union 584.0
United Kingdom 505.0
France 475.0
Germany 454.0

Silver Olympic medals

  
silver = pd.read_csv('silver_top5.csv', index_col=0)
print(silver)
Total
Country
United States 1195.0
Soviet Union 627.0
United Kingdom 591.0
France 461.0
Italy 394.0

Gold Olympic medals

  
gold = pd.read_csv('gold_top5.csv', index_col=0)
print(gold)
Total
Country
United States 2088.0
Soviet Union 838.0
United Kingdom 498.0
Italy 460.0
Germany 407.0

Adding bronze, silver

  
bronze + silver

Country
France 936.0
Germany NaN
Italy NaN
Soviet Union 1211.0
United Kingdom 1096.0
United States 2247.0
Name: Total, dtype: float64

Adding bronze, silver

  
bronze + silver

Country
France 936.0
Germany NaN
Italy NaN
Soviet Union 1211.0
United Kingdom 1096.0
United States 2247.0
Name: Total, dtype: float64
In [22]: print(bronze['United States'])
1052.0
In [23]: print(silver['United States'])
1195.0

Using the .add() method

  
bronze.add(silver)

Country
France 936.0
Germany NaN
Italy NaN
Soviet Union 1211.0
United Kingdom 1096.0
United States 2247.0
Name: Total, dtype: float64

Using a fill_value

  
bronze.add(silver, fill_value=0)

Country
France 936.0
Germany 454.0
Italy 394.0
Soviet Union 1211.0
United Kingdom 1096.0
United States 2247.0
Name: Total, dtype: float64

Adding bronze, silver, gold

  
bronze + silver + gold

Country
France NaN
Germany NaN
Italy NaN
Soviet Union 2049.0
United Kingdom 1594.0
United States 4335.0
Name: Total, dtype: float64

Chaining .add()

  
bronze.add(silver, fill_value=0).add(gold, fill_value=0)

Country
France 936.0
Germany 861.0
Italy 854.0
Soviet Union 2049.0
United Kingdom 1594.0
United States 4335.0
Name: Total, dtype: float64

Exercises

Adding unaligned DataFrames

The DataFrames january and february, which have been printed in the IPython Shell, represent the sales a company made in the corresponding months.

The Indexes in both DataFrames are called Company, identifying which company bought that quantity of units. The column Units is the number of units sold.

If you were to add these two DataFrames by executing the command total = january + february, how many rows would the resulting DataFrame have? Try this in the IPython Shell and find out for yourself.

  
jan_dict = {'Company': ['Acme Corporation', 'Hooli', 'Initech', 'Mediacore', 'Streeplex'],
            'Units': [19, 17, 20, 10, 13]}
feb_dict = {'Company': ['Acme Corporation', 'Hooli', 'Mediacore', 'Vandelay Inc'],
            'Units': [15, 3, 12, 25]}

january = pd.DataFrame.from_dict(jan_dict)
january.set_index('Company', inplace=True)
print(january)

february = pd.DataFrame.from_dict(feb_dict)
february.set_index('Company', inplace=True)
print('\n', february, '\n')

print(january + february)

                  Units
Company                
Acme Corporation     19
Hooli                17
Initech              20
Mediacore            10
Streeplex            13

                   Units
Company                
Acme Corporation     15
Hooli                 3
Mediacore            12
Vandelay Inc         25 

                  Units
Company                
Acme Corporation   34.0
Hooli              20.0
Initech             NaN
Mediacore          22.0
Streeplex           NaN
Vandelay Inc        NaN

Broadcasting in Arithmetic formulas

In this exercise, you’ll work with weather data pulled from wunderground.com. The DataFrame weather has been pre-loaded along with pandas as pd. It has 365 rows (observed each day of the year 2013 in Pittsburgh, PA) and 22 columns reflecting different weather measurements each day.

You’ll subset a collection of columns related to temperature measurements in degrees Fahrenheit, convert them to degrees Celsius, and relabel the columns of the new DataFrame to reflect the change of units.

Remember, ordinary arithmetic operators (like +, -, *, and /) broadcast scalar values to conforming DataFrames when combining scalars & DataFrames in arithmetic expressions. Broadcasting also works with pandas Series and NumPy arrays.

Instructions

Create a new DataFrame temps_f by extracting the columns 'Min TemperatureF', 'Mean TemperatureF', & 'Max TemperatureF' from weather as a new DataFrame temps_f. To do this, pass the relevant columns as a list to weather[].
Create a new DataFrame temps_c from temps_f using the formula (temps_f - 32) * 5/9.
Rename the columns of temps_c to replace 'F' with 'C' using the .str.replace('F', 'C') method on temps_c.columns.
Print the first 5 rows of DataFrame temps_c. This has been done for you, so hit ‘Submit Answer’ to see the result!

  
weather = pd.read_csv(pitts_file)
weather.set_index('Date', inplace=True)
weather.head(3)

	Max TemperatureF	Mean TemperatureF	Min TemperatureF	Max Dew PointF	MeanDew PointF	Min DewpointF	Max Humidity	Mean Humidity	Min Humidity	Max Sea Level PressureIn	Mean Sea Level PressureIn	Min Sea Level PressureIn	Max VisibilityMiles	Mean VisibilityMiles	Min VisibilityMiles	Max Wind SpeedMPH	Mean Wind SpeedMPH	Max Gust SpeedMPH	PrecipitationIn	CloudCover	Events	WindDirDegrees
Date
2013-1-1	32	28	21	30	27	16	100	89	77	30.10	30.01	29.94	10	6	2	10	8	NaN	0.0	8	Snow	277
2013-1-2	25	21	17	14	12	10	77	67	55	30.27	30.18	30.08	10	10	10	14	5	NaN	0.0	4	NaN	272
2013-1-3	32	24	16	19	15	9	77	67	56	30.25	30.21	30.16	10	10	10	17	8	26.0	0.0	3	NaN	229

  
# Extract selected columns from weather as new DataFrame: temps_f
temps_f = weather[['Min TemperatureF', 'Mean TemperatureF', 'Max TemperatureF']]

# Convert temps_f to celsius: temps_c
temps_c = (temps_f - 32) * 5/9

# Rename 'F' in column names with 'C': temps_c.columns
temps_c.columns = temps_c.columns.str.replace('F', 'C')

# Print first 5 rows of temps_c
temps_c.head()

	Min TemperatureC	Mean TemperatureC	Max TemperatureC
Date
2013-1-1	-6.111111	-2.222222	0.000000
2013-1-2	-8.333333	-6.111111	-3.888889
2013-1-3	-8.888889	-4.444444	0.000000
2013-1-4	-2.777778	-2.222222	-1.111111
2013-1-5	-3.888889	-1.111111	1.111111

Computing percentage growth of GDP

Your job in this exercise is to compute the yearly percent-change of US GDP (Gross Domestic Product) since 2008.

The data has been obtained from the Federal Reserve Bank of St. Louis and is available in the file GDP.csv, which contains quarterly data; you will resample it to annual sampling and then compute the annual growth of GDP. For a refresher on resampling, check out the relevant material from pandas Foundations.

Instructions

Read the file 'GDP.csv' into a DataFrame called gdp.
Use parse_dates=True and index_col='DATE'.
Create a DataFrame post2008 by slicing gdp such that it comprises all rows from 2008 onward.
Print the last 8 rows of the slice post2008. This has been done for you. This data has quarterly frequency so the indices are separated by three-month intervals.
Create the DataFrame yearly by resampling the slice post2008 by year. Remember, you need to chain .resample() (using the alias 'A' for annual frequency) with some kind of aggregation; you will use the aggregation method .last() to select the last element when resampling.
Compute the percentage growth of the resampled DataFrame yearly with .pct_change() * 100.

  
# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv(gdp_usa_file, parse_dates=True, index_col='DATE')
gdp.head()

	VALUE
DATE
1947-01-01	243.1
1947-04-01	246.3
1947-07-01	250.1
1947-10-01	260.3
1948-01-01	266.2

  
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008-01-01':]

# Print the last 8 rows of post2008
post2008.tail(8)

	VALUE
DATE
2014-07-01	17569.4
2014-10-01	17692.2
2015-01-01	17783.6
2015-04-01	17998.3
2015-07-01	18141.9
2015-10-01	18222.8
2016-01-01	18281.6
2016-04-01	18436.5

  
# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('YE').last()

# Print yearly
yearly

	VALUE
DATE
2008-12-31	14549.9
2009-12-31	14566.5
2010-12-31	15230.2
2011-12-31	15785.3
2012-12-31	16297.3
2013-12-31	16999.9
2014-12-31	17692.2
2015-12-31	18222.8
2016-12-31	18436.5

  
# Compute percentage growth of yearly: yearly['growth']
yearly['growth'] = yearly.pct_change() * 100

# Print yearly again
yearly

	VALUE	growth
DATE
2008-12-31	14549.9	NaN
2009-12-31	14566.5	0.114090
2010-12-31	15230.2	4.556345
2011-12-31	15785.3	3.644732
2012-12-31	16297.3	3.243524
2013-12-31	16999.9	4.311144
2014-12-31	17692.2	4.072377
2015-12-31	18222.8	2.999062
2016-12-31	18436.5	1.172707

Converting currency of stocks

In this exercise, stock prices in US Dollars for the S&P 500 in 2015 have been obtained from Yahoo Finance. The files sp500.csv for sp500 and exchange.csv for the exchange rates are both provided to you.

Using the daily exchange rate to Pounds Sterling, your task is to convert both the Open and Close column prices.

Instructions

Read the DataFrames sp500 & exchange from the files 'sp500.csv' & 'exchange.csv' respectively..
Use parse_dates=True and index_col='Date'.
Extract the columns 'Open' & 'Close' from the DataFrame sp500 as a new DataFrame dollars and print the first 5 rows.
Construct a new DataFrame pounds by converting US dollars to British pounds. You’ll use the .multiply() method of dollars with exchange['GBP/USD'] and axis='rows'
Print the first 5 rows of the new DataFrame pounds. This has been done for you, so hit ‘Submit Answer’ to see the results!.

  
# Read 'sp500.csv' into a DataFrame: sp500
sp500 = pd.read_csv(sp500_file, parse_dates=True, index_col='Date')
sp500.head()

	Open	High	Low	Close	Adj Close	Volume	tkr
Date
1980-01-02	0.0	108.430000	105.290001	105.760002	105.760002	40610000	^gspc
1980-01-03	0.0	106.080002	103.260002	105.220001	105.220001	50480000	^gspc
1980-01-04	0.0	107.080002	105.089996	106.519997	106.519997	39130000	^gspc
1980-01-07	0.0	107.800003	105.800003	106.809998	106.809998	44500000	^gspc
1980-01-08	0.0	109.290001	106.290001	108.949997	108.949997	53390000	^gspc

  
# Read 'exchange.csv' into a DataFrame: exchange
exchange = pd.read_csv(exch_rates_file, parse_dates=True, index_col='Date')
exchange.head()

	GBP/USD
Date
2015-01-02	0.65101
2015-01-05	0.65644
2015-01-06	0.65896
2015-01-07	0.66344
2015-01-08	0.66151

  
# Subset 'Open' & 'Close' columns from sp500: dollars
dollars = sp500[['Open', 'Close']]

# Print the head of dollars
dollars.head()

	Open	Close
Date
1980-01-02	0.0	105.760002
1980-01-03	0.0	105.220001
1980-01-04	0.0	106.519997
1980-01-07	0.0	106.809998
1980-01-08	0.0	108.949997

  
# Convert dollars to pounds: pounds
pounds = dollars.multiply(exchange['GBP/USD'], axis='rows')

# Print the head of pounds
pounds.head()

	Open	Close
Date
1980-01-02	NaN	NaN
1980-01-03	NaN	NaN
1980-01-04	NaN	NaN
1980-01-07	NaN	NaN
1980-01-08	NaN	NaN

  
del january, february, feb_dict, jan_dict, weather, temps_f, temps_c, gdp, post2008, yearly, sp500, exchange, dollars, pounds

Concatenating Data

Having learned how to import multiple DataFrames and share information using Indexes, in this chapter you’ll learn how to perform database-style operations to combine DataFrames. In particular, you’ll learn about appending and concatenating DataFrames while working with a variety of real-world datasets.

Appending & concatenating Series

append()

.append(): Series & DataFrame method
Invocation:
s1.append(s2)
Stacks rows of s2 below s1
Method for Series & DataFrames

concat()

concat(): pandas module function
Invocation:
pd.concat([s1, s2, s3])
Can stack row-wise or column-wise

concat() & .append()

Equivalence of concat() & .append():
result1 = pd.concat([s1, s2, s3])
result2 = s1.append(s2).append(s3)
result1 == result2 elementwise

Series of US states

  
import pandas as pd
northeast = pd.Series(['CT', 'ME', 'MA', 'NH', 'RI', 'VT', 'NJ', 'NY', 'PA'])
south = pd.Series(['DE', 'FL', 'GA', 'MD', 'NC', 'SC', 'VA', 'DC', 'WV', 'AL', 'KY', 'MS', 'TN', 'AR', 'LA', 'OK', 'TX'])
midwest = pd.Series(['IL', 'IN', 'MN', 'MO', 'NE', 'ND', 'SD', 'IA', 'KS', 'MI', 'OH', 'WI'])
west = pd.Series(['AZ', 'CO', 'ID', 'MT',

Using .append()

  
east = northeast.append(south)
print(east)
CT       7 DC
ME       8 WV
MA       9 AL
NH       10 KY
RI       11 MS
VT       12 TN
NJ       13 AR
NY       14 LA
PA       15 OK
DE       16 TX
FL       dtype: object
GA
MD
NC
SC
VA

The appended Index

  
print(east.index)
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], dtype='int64')

print(east.loc[3])
3 NH
3 MD
dtype: object

Using .reset_index()

  
new_east = northeast.append(south).reset_index(drop=True)
print(new_east.head(11))
0 CT
1 ME
2 MA
3 NH
4 RI
5 VT
6 NJ
7 NY
8 PA
9 DE
10 FL
dtype: object

print(new_east.index)
RangeIndex(start=0, stop=26, step=1)

Using concat()

  
east = pd.concat([northeast, south])
print(east.head(11))
0 CT
1 ME
2 MA
3 NH
4 RI
5 VT
6 NJ
7 NY
8 PA
0 DE
1 FL
dtype: object
print(east.index)
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], dtype='int64')

Using ignore_index

  
new_east = pd.concat([northeast, south], ignore_index=True)
print(new_east.head(11))
0 CT
1 ME
2 MA
3 NH
4 RI
5 VT
6 NJ
7 NY
8 PA
9 DE
10 FL
dtype: object
print(new_east.index)
RangeIndex(start=0, stop=26, step=1)

Exercises

Appending Series with nonunique Indices

The Series bronze and silver, which have been printed in the IPython Shell, represent the 5 countries that won the most bronze and silver Olympic medals respectively between 1896 & 2008. The Indexes of both Series are called Country and the values are the corresponding number of medals won.

If you were to run the command combined = bronze.append(silver), how many rows would combined have? And how many rows would combined.loc['United States'] return? Find out for yourself by running these commands in the IPython Shell.

Instructions

Possible Answers

combined has 5 rows and combined.loc[‘United States’] is empty (0 rows).
combined has 10 rows and combined.loc['United States'] has 2 rows.
combined has 6 rows and combined.loc[‘United States’] has 1 row.
combined has 5 rows and combined.loc[‘United States’] has 2 rows.

  
bronze = pd.read_csv(so_bronze5_file, index_col=0)
bronze

	Total
Country
United States	1052.0
Soviet Union	584.0
United Kingdom	505.0
France	475.0
Germany	454.0

  
silver = pd.read_csv(so_silver5_file, index_col=0)
silver

	Total
Country
United States	1195.0
Soviet Union	627.0
United Kingdom	591.0
France	461.0
Italy	394.0

  
combined = pd.concat([bronze, silver])
combined

	Total
Country
United States	1052.0
Soviet Union	584.0
United Kingdom	505.0
France	475.0
Germany	454.0
United States	1195.0
Soviet Union	627.0
United Kingdom	591.0
France	461.0
Italy	394.0

  
combined.loc['United States']

	Total
Country
United States	1052.0
United States	1195.0

Appending pandas Series

In this exercise, you’ll load sales data from the months January, February, and March into DataFrames. Then, you’ll extract Series with the 'Units' column from each and append them together with method chaining using .append().

To check that the stacking worked, you’ll print slices from these Series, and finally, you’ll add the result to figure out the total units sold in the first quarter.

Instructions

Read the files 'sales-jan-2015.csv', 'sales-feb-2015.csv' and 'sales-mar-2015.csv' into the DataFrames jan, feb, and mar respectively.
Use parse_dates=True and index_col='Date'.
Extract the 'Units' column of jan, feb, and mar to create the Series jan_units, feb_units, and mar_units respectively.
Construct the Series quarter1 by appending feb_units to jan_units and then appending mar_units to the result. Use chained calls to the .append() method to do this.
Verify that quarter1 has the individual Series stacked vertically. To do this:
Print the slice containing rows from jan 27, 2015 to feb 2, 2015.
Print the slice containing rows from feb 26, 2015 to mar 7, 2015.
Compute and print the total number of units sold from the Series quarter1. This has been done for you, so hit ‘Submit Answer’ to see the result!

  
# Load 'sales-jan-2015.csv' into a DataFrame: jan
jan = pd.read_csv(sales_jan_2015_file, parse_dates=True, index_col='Date')

# Load 'sales-feb-2015.csv' into a DataFrame: feb
feb = pd.read_csv(sales_feb_2015_file, parse_dates=True, index_col='Date')

# Load 'sales-mar-2015.csv' into a DataFrame: mar
mar = pd.read_csv(sales_mar_2015_file, parse_dates=True, index_col='Date')

# Extract the 'Units' column from jan: jan_units
jan_units = jan['Units']

# Extract the 'Units' column from feb: feb_units
feb_units = feb['Units']

# Extract the 'Units' column from mar: mar_units
mar_units = mar['Units']

# Append feb_units and then mar_units to jan_units: quarter1
quarter1 = pd.concat([jan_units, feb_units, mar_units])

# Print the first slice from quarter1
display(quarter1.sort_index().loc['jan 27, 2015':'feb 2, 2015'])

# Print the second slice from quarter1
display(quarter1.sort_index().loc['feb 26, 2015':'mar 7, 2015'])

# Compute & print total sales in quarter1
display(quarter1.sum())

Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64



Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 02:03:56    17
2015-03-06 10:11:45    17
Name: Units, dtype: int64



642

Concatenating pandas Series along row axis

Having learned how to append Series, you’ll now learn how to achieve the same result by concatenating Series instead. You’ll continue to work with the sales data you’ve seen previously. This time, the DataFrames jan, feb, and mar have been pre-loaded.

Your job is to use pd.concat() with a list of Series to achieve the same result that you would get by chaining calls to .append().

You may be wondering about the difference between pd.concat() and pandas’ .append() method. One way to think of the difference is that .append() is a specific case of a concatenation, while pd.concat() gives you more flexibility, as you’ll see in later exercises.

Instructions

Create an empty list called units. This has been done for you.
- Use a for loop to iterate over [jan, feb, mar]:
In each iteration of the loop, append the 'Units' column of each DataFrame to units.
- Concatenate the Series contained in the list units into a longer Series called quarter1 using pd.concat().
Specify the keyword argument axis='rows' to stack the Series vertically.
Verify that quarter1 has the individual Series stacked vertically by printing slices. This has been done for you, so hit ‘Submit Answer’ to see the result!

  
# Initialize empty list: units
units = []

# Build the list of Series
for month in [jan, feb, mar]:
    units.append(month.Units)

# Concatenate the list: quarter1
quarter1 = pd.concat(units, axis='rows')

# Print slices from quarter1
print(quarter1.sort_index().loc['jan 27, 2015':'feb 2, 2015'])
print(quarter1.sort_index().loc['feb 26, 2015':'mar 7, 2015'])

Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64
Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 02:03:56    17
2015-03-06 10:11:45    17
Name: Units, dtype: int64

  
del bronze, silver, combined, jan, feb, mar, jan_units, feb_units, mar_units, quarter1

Appending & concatenating DataFrames

Loading population data

  
import pandas as pd
pop1 = pd.read_csv('population_01.csv', index_col=0)
pop2 = pd.read_csv('population_02.csv', index_col=0)

  
pop1_data = {'Zip Code ZCTA': [66407, 72732, 50579, 46421], '2010 Census Population': [479, 4716, 2405, 30670]}
pop2_data = {'Zip Code ZCTA': [12776, 76092, 98360, 49464], '2010 Census Population': [2180, 26669, 12221, 27481]}

pop1 = pd.DataFrame.from_dict(pop1_data)
pop1.set_index('Zip Code ZCTA', drop=True, inplace=True)
pop2 = pd.DataFrame.from_dict(pop2_data)
pop2.set_index('Zip Code ZCTA', drop=True, inplace=True)

Examining population data

pop1

	2010 Census Population
Zip Code ZCTA
66407	479
72732	4716
50579	2405
46421	30670

pop2

	2010 Census Population
Zip Code ZCTA
12776	2180
76092	26669
98360	12221
49464	27481

  
print(type(pop1), pop1.shape)
print(type(pop2), pop2.shape)

<class 'pandas.core.frame.DataFrame'> (4, 1)
<class 'pandas.core.frame.DataFrame'> (4, 1)

Appending population DataFrames

  
pd.concat([pop1, pop2])

	2010 Census Population
Zip Code ZCTA
66407	479
72732	4716
50579	2405
46421	30670
12776	2180
76092	26669
98360	12221
49464	27481

  
print(pop1.index.name, pop1.columns)
print(pop2.index.name, pop2.columns)

Zip Code ZCTA Index(['2010 Census Population'], dtype='object')
Zip Code ZCTA Index(['2010 Census Population'], dtype='object')

Population & unemployment data

  
population = pd.read_csv('population_00.csv', index_col=0)
unemployment = pd.read_csv('unemployment_00.csv', index_col=0)

  
pop_data = {'Zip Code ZCTA': [57538, 59916, 37660, 2860], '2010 Census Population': [322, 130, 40038, 45199]}
emp_data = {'Zip': [2860, 46167, 1097, 80808], 'unemployment': [0.11, 0.02, 0.33, 0.07], 'participants': [34447, 4800, 42, 4310]}

population = pd.DataFrame.from_dict(pop_data)
population.set_index('Zip Code ZCTA', drop=True, inplace=True)
unemployment = pd.DataFrame.from_dict(emp_data)
unemployment.set_index('Zip', drop=True, inplace=True)

population

	2010 Census Population
Zip Code ZCTA
57538	322
59916	130
37660	40038
2860	45199

unemployment

	unemployment	participants
Zip
2860	0.11	34447
46167	0.02	4800
1097	0.33	42
80808	0.07	4310

Appending population & unemployment

  
pd.concat([population, unemployment], sort=True)

	2010 Census Population	participants	unemployment
57538	322.0	NaN	NaN
59916	130.0	NaN	NaN
37660	40038.0	NaN	NaN
2860	45199.0	NaN	NaN
2860	NaN	34447.0	0.11
46167	NaN	4800.0	0.02
1097	NaN	42.0	0.33
80808	NaN	4310.0	0.07

Repeated index labels

  
pd.concat([population, unemployment], sort=True)

	2010 Census Population	participants	unemployment
57538	322.0	NaN	NaN
59916	130.0	NaN	NaN
37660	40038.0	NaN	NaN
2860	45199.0	NaN	NaN
2860	NaN	34447.0	0.11
46167	NaN	4800.0	0.02
1097	NaN	42.0	0.33
80808	NaN	4310.0	0.07

Concatenating rows

with axis=0, pd.concat is the same as population.append(unemployment, sort=True)

  
pd.concat([population, unemployment], axis=0, sort=True)

	2010 Census Population	participants	unemployment
57538	322.0	NaN	NaN
59916	130.0	NaN	NaN
37660	40038.0	NaN	NaN
2860	45199.0	NaN	NaN
2860	NaN	34447.0	0.11
46167	NaN	4800.0	0.02
1097	NaN	42.0	0.33
80808	NaN	4310.0	0.07

Concatenating column

outer join

  
pd.concat([population, unemployment], axis=1, sort=True)

	2010 Census Population	unemployment	participants
1097	NaN	0.33	42.0
2860	45199.0	0.11	34447.0
37660	40038.0	NaN	NaN
46167	NaN	0.02	4800.0
57538	322.0	NaN	NaN
59916	130.0	NaN	NaN
80808	NaN	0.07	4310.0

  
del pop1_data, pop2_data, pop1, pop2, pop_data, emp_data, population, unemployment

Exercises

Appending DataFrames with ignore_index

In this exercise, you’ll use the Baby Names Dataset (from data.gov) again. This time, both DataFrames names_1981 and names_1881 are loaded without specifying an Index column (so the default Indexes for both are RangeIndexes).

You’ll use the DataFrame .append() method to make a DataFrame combined_names. To distinguish rows from the original two DataFrames, you’ll add a 'year' column to each with the year (1881 or 1981 in this case). In addition, you’ll specify ignore_index=True so that the index values are not used along the concatenation axis. The resulting axis will instead be labeled 0, 1, ..., n-1, which is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.

Instructions

Create a 'year' column in the DataFrames names_1881 and names_1981, with values of 1881 and 1981 respectively. Recall that assigning a scalar value to a DataFrame column broadcasts that value throughout.
Create a new DataFrame called combined_names by appending the rows of names_1981 underneath the rows of names_1881. Specify the keyword argument ignore_index=True to make a new RangeIndex of unique integers for each row.
Print the shapes of all three DataFrames. This has been done for you.
Extract all rows from combined_names that have the name 'Morgan'. To do this, use the .loc[] accessor with an appropriate filter. The relevant column of combined_names here is 'name'.

  
names_1881 = pd.read_csv(baby_1881_file, header=None, names=['name', 'gender', 'count'])
names_1981 = pd.read_csv(baby_1981_file, header=None, names=['name', 'gender', 'count'])

  
names_1981.head()

	name	gender	count
0	Jennifer	F	57032
1	Jessica	F	42519
2	Amanda	F	34370
3	Sarah	F	28162
4	Melissa	F	28003

  
# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981

# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = pd.concat([names_1881, names_1981], ignore_index=True, sort=False)

# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)

# Print all rows that contain the name 'Morgan'
combined_names[combined_names.name  == 'Morgan']

(19455, 4)
(1935, 4)
(21390, 4)

	name	gender	count	year
1283	Morgan	M	23	1881
2096	Morgan	F	1769	1981
14390	Morgan	M	766	1981

Concatenating pandas DataFrames along column axis

The function pd.concat() can concatenate DataFrames horizontally as well as vertically (vertical is the default). To make the DataFrames stack horizontally, you have to specify the keyword argument axis=1 or axis='columns'.

In this exercise, you’ll use weather data with maximum and mean daily temperatures sampled at different rates (quarterly versus monthly). You’ll concatenate the rows of both and see that, where rows are missing in the coarser DataFrame, null values are inserted in the concatenated DataFrame. This corresponds to an outer join (which you will explore in more detail in later exercises).

The files 'quarterly_max_temp.csv' and 'monthly_mean_temp.csv' have been pre-loaded into the DataFrames weather_max and weather_mean respectively, and pandas has been imported as pd.

Instructions

Create a new DataFrame called weather by concatenating the DataFrames weather_max and weather_mean horizontally.
- Pass the DataFrames to pd.concat() as a list and specify the keyword argument axis=1 to stack them horizontally.
Print the new DataFrame weather.

  
weather_mean_data = {'Mean TemperatureF': [53.1, 70., 34.93548387, 28.71428571, 32.35483871, 72.87096774, 70.13333333, 35., 62.61290323, 39.8, 55.4516129 , 63.76666667],
                     'Month': ['Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep']}
weather_max_data = {'Max TemperatureF': [68, 89, 91, 84], 'Month': ['Jan', 'Apr', 'Jul', 'Oct']}

weather_mean = pd.DataFrame.from_dict(weather_mean_data)
weather_mean.set_index('Month', inplace=True, drop=True)
weather_max = pd.DataFrame.from_dict(weather_max_data)
weather_max.set_index('Month', inplace=True, drop=True)

weather_max

	Max TemperatureF
Month
Jan	68
Apr	89
Jul	91
Oct	84

weather_mean

	Mean TemperatureF
Month
Apr	53.100000
Aug	70.000000
Dec	34.935484
Feb	28.714286
Jan	32.354839
Jul	72.870968
Jun	70.133333
Mar	35.000000
May	62.612903
Nov	39.800000
Oct	55.451613
Sep	63.766667

  
# Concatenate weather_max and weather_mean horizontally: weather
weather = pd.concat([weather_max, weather_mean], axis=1, sort=True)

# Print weather
weather

	Max TemperatureF	Mean TemperatureF
Month
Apr	89.0	53.100000
Aug	NaN	70.000000
Dec	NaN	34.935484
Feb	NaN	28.714286
Jan	68.0	32.354839
Jul	91.0	72.870968
Jun	NaN	70.133333
Mar	NaN	35.000000
May	NaN	62.612903
Nov	NaN	39.800000
Oct	84.0	55.451613
Sep	NaN	63.766667

Reading multiple files to build a DataFrame

It is often convenient to build a large DataFrame by parsing many files as DataFrames and concatenating them all at once. You’ll do this here with three files, but, in principle, this approach can be used to combine data from dozens or hundreds of files.

Here, you’ll work with DataFrames compiled from The Guardian’s Olympic medal dataset.

pandas has been imported as pd and two lists have been pre-loaded: An empty list called medals, and medal_types, which contains the strings 'bronze', 'silver', and 'gold'.

Instructions

Iterate over medal_types in the for loop.
Inside the for loop:
- Create file_name using string interpolation with the loop variable medal. This has been done for you. The expression "%s_top5.csv" % medal evaluates as a string with the value of medal replacing %s in the format string.
- Create the list of column names called columns. This has been done for you.
- Read file_name into a DataFrame called medal_df. Specify the keyword arguments header=0, index_col='Country', and names=columns to get the correct row and column Indexes.
- Append medal_df to medals using the list .append() method.
Concatenate the list of DataFrames medals horizontally (using axis='columns') to create a single DataFrame called medals. Print it in its entirety.

  
top_five = data.glob('*_top5.csv')
for file in top_five:
    print(file)

D:\users\trenton\Dropbox\PythonProjects\DataCamp\data\merging-dataframes-with-pandas\summer_olympics_bronze_top5.csv
D:\users\trenton\Dropbox\PythonProjects\DataCamp\data\merging-dataframes-with-pandas\summer_olympics_gold_top5.csv
D:\users\trenton\Dropbox\PythonProjects\DataCamp\data\merging-dataframes-with-pandas\summer_olympics_silver_top5.csv

  
medal_types = ['bronze', 'silver', 'gold']
medal_list = list()

for medal in medal_types:

    # Create the file name: file_name
    file_name = data / f'summer_olympics_{medal}_top5.csv'
    
    # Create list of column names: columns
    columns = ['Country', medal]
    
    # Read file_name into a DataFrame: df
    medal_df = pd.read_csv(file_name, header=0, index_col='Country', names=columns)

    # Append medal_df to medals
    medal_list.append(medal_df)

# Concatenate medals horizontally: medals
medals = pd.concat(medal_list, axis='columns', sort=True)

# Print medals
medals

	bronze	silver	gold
Country
France	475.0	461.0	NaN
Germany	454.0	NaN	407.0
Italy	NaN	394.0	460.0
Soviet Union	584.0	627.0	838.0
United Kingdom	505.0	591.0	498.0
United States	1052.0	1195.0	2088.0

  
del names_1881, names_1981, combined_names, weather_mean_data, weather_max_data, weather_mean, weather_max, weather, top_five, medals, medal_list

Concatenation, keys & MultiIndexes

Loading rainfall data

  
import pandas as pd
file1 = 'q1_rainfall_2013.csv'
rain2013 = pd.read_csv(file1, index_col='Month', parse_dates=True)
file2 = 'q1_rainfall_2014.csv'
rain2014 = pd.read_csv(file2, index_col='Month', parse_dates=True)

  
rain_2013_data = {'Month': ['Jan', 'Feb', 'Mar'], 'Precipitation': [0.096129, 0.067143, 0.061613]}
rain_2014_data = {'Month': ['Jan', 'Feb', 'Mar'], 'Precipitation': [0.050323, 0.082143, 0.070968]}

rain2013 = pd.DataFrame.from_dict(rain_2013_data)
rain2013.set_index('Month', inplace=True)
rain2014 = pd.DataFrame.from_dict(rain_2014_data)
rain2014.set_index('Month', inplace=True)

Examining rainfall data

rain2013

	Precipitation
Month
Jan	0.096129
Feb	0.067143
Mar	0.061613

rain2014

	Precipitation
Month
Jan	0.050323
Feb	0.082143
Mar	0.070968

Concatenating rows

  
pd.concat([rain2013, rain2014], axis=0)

	Precipitation
Month
Jan	0.096129
Feb	0.067143
Mar	0.061613
Jan	0.050323
Feb	0.082143
Mar	0.070968

Using multi-index on rows

  
rain1314 = pd.concat([rain2013, rain2014], keys=[2013, 2014], axis=0)
rain1314

		Precipitation
	Month
2013	Jan	0.096129
	Feb	0.067143
	Mar	0.061613
2014	Jan	0.050323
	Feb	0.082143
	Mar	0.070968

Accessing a multi-index

  
rain1314.loc[2014]

	Precipitation
Month
Jan	0.050323
Feb	0.082143
Mar	0.070968

Concatenating columns

  
rain1314 = pd.concat([rain2013, rain2014], axis='columns')
rain1314

	Precipitation	Precipitation
Month
Jan	0.096129	0.050323
Feb	0.067143	0.082143
Mar	0.061613	0.070968

Using a multi-index on columns

  
rain1314 = pd.concat([rain2013, rain2014], keys=[2013, 2014], axis='columns')
rain1314

	2013	2014
	Precipitation	Precipitation
Month
Jan	0.096129	0.050323
Feb	0.067143	0.082143
Mar	0.061613	0.070968

  
rain1314[2013]

	Precipitation
Month
Jan	0.096129
Feb	0.067143
Mar	0.061613

pd.concat() with dict

  
rain_dict = {2013: rain2013, 2014: rain2014}
rain1314 = pd.concat(rain_dict, axis='columns')
rain1314

	2013	2014
	Precipitation	Precipitation
Month
Jan	0.096129	0.050323
Feb	0.067143	0.082143
Mar	0.061613	0.070968

  
del rain_2013_data, rain_2014_data, rain2013, rain2014, rain1314

Exercises

Concatenating vertically to get MultiIndexed rows

When stacking a sequence of DataFrames vertically, it is sometimes desirable to construct a MultiIndex to indicate the DataFrame from which each row originated. This can be done by specifying the keys parameter in the call to pd.concat(), which generates a hierarchical index with the labels from keys as the outermost index label. So you don’t have to rename the columns of each DataFrame as you load it. Instead, only the Index column needs to be specified.

Here, you’ll continue working with DataFrames compiled from The Guardian’s Olympic medal dataset. Once again, pandas has been imported as pd and two lists have been pre-loaded: An empty list called medals, and medal_types, which contains the strings 'bronze', 'silver', and 'gold'.

Instructions

Within the for loop:
- Read file_name into a DataFrame called medal_df. Specify the index to be 'Country'.
- Append medal_df to medals.
Concatenate the list of DataFrames medals into a single DataFrame called medals. Be sure to use the keyword argument keys=['bronze', 'silver', 'gold'] to create a vertically stacked DataFrame with a MultiIndex.
Print the new DataFrame medals. This has been done for you, so hit ‘Submit Answer’ to see the result!

  
medal_types = ['bronze', 'silver', 'gold']
medal_list = list()

for medal in medal_types:

    # Create the file name: file_name
    file_name = data / f'summer_olympics_{medal}_top5.csv'
    
    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(file_name, index_col='Country')
    
    # Append medal_df to medals
    medal_list.append(medal_df)
    
# Concatenate medals: medals
medals = pd.concat(medal_list, keys=['bronze', 'silver', 'gold'])

# Print medals in entirety
print(medals)

                        Total
       Country               
bronze United States   1052.0
       Soviet Union     584.0
       United Kingdom   505.0
       France           475.0
       Germany          454.0
silver United States   1195.0
       Soviet Union     627.0
       United Kingdom   591.0
       France           461.0
       Italy            394.0
gold   United States   2088.0
       Soviet Union     838.0
       United Kingdom   498.0
       Italy            460.0
       Germany          407.0

Slicing MultiIndexed DataFrames

This exercise picks up where the last ended (again using The Guardian’s Olympic medal dataset).

You are provided with the MultiIndexed DataFrame as produced at the end of the preceding exercise. Your task is to sort the DataFrame and to use the pd.IndexSlice to extract specific slices. Check out this exercise from Manipulating DataFrames with pandas to refresh your memory on how to deal with MultiIndexed DataFrames.

pandas has been imported for you as pd and the DataFrame medals is already in your namespace.

Instructions

Create a new DataFrame medals_sorted with the entries of medals sorted. Use .sort_index(level=0) to ensure the Index is sorted suitably.
Print the number of bronze medals won by Germany and all of the silver medal data. This has been done for you.
Create an alias for pd.IndexSlice called idx. A slicer pd.IndexSlice is required when slicing on the inner level of a MultiIndex.
Slice all the data on medals won by the United Kingdom. To do this, use the .loc[] accessor with idx[:,'United Kingdom'], :.

  
# Sort the entries of medals: medals_sorted
medals_sorted = medals.sort_index(level=0)

# Print the number of Bronze medals won by Germany
print(medals_sorted.loc[('bronze','Germany')])

Total    454.0
Name: (bronze, Germany), dtype: float64

  
# Print data about silver medals
print(medals_sorted.loc['silver'])

                 Total
Country               
France           461.0
Italy            394.0
Soviet Union     627.0
United Kingdom   591.0
United States   1195.0

  
# Create alias for pd.IndexSlice: idx
idx = pd.IndexSlice

# Print all the data on medals won by the United Kingdom
medals_sorted.loc[idx[:, 'United Kingdom'], :]

		Total
	Country
bronze	United Kingdom	505.0
gold	United Kingdom	498.0
silver	United Kingdom	591.0

Concatenating horizontally to get MultiIndexed columns

It is also possible to construct a DataFrame with hierarchically indexed columns. For this exercise, you’ll start with pandas imported and a list of three DataFrames called dataframes. All three DataFrames contain 'Company', 'Product', and 'Units' columns with a 'Date' column as the index pertaining to sales transactions during the month of February, 2015. The first DataFrame describes Hardware transactions, the second describes Software transactions, and the third, Service transactions.

Your task is to concatenate the DataFrames horizontally and to create a MultiIndex on the columns. From there, you can summarize the resulting DataFrame and slice some information from it.

Instructions

Construct a new DataFrame february with MultiIndexed columns by concatenating the list dataframes.
Use axis=1 to stack the DataFrames horizontally and the keyword argument keys=['Hardware', 'Software', 'Service'] to construct a hierarchical Index from each DataFrame.
Print summary information from the new DataFrame february using the .info() method. This has been done for you.
Create an alias called idx for pd.IndexSlice.
Extract a slice called slice_2_8 from february (using .loc[] & idx) that comprises rows between Feb. 2, 2015 to Feb. 8, 2015 from columns under 'Company'.
Print the slice_2_8. This has been done for you, so hit ‘Submit Answer’ to see the sliced data!

  
hw = pd.read_csv(sales_feb_hardware_file, index_col='Date')
sw = pd.read_csv(sales_feb_software_file, index_col='Date')
sv = pd.read_csv(sales_feb_service_file, index_col='Date')

dataframes = [hw, sw, sv]
dataframes

[                             Company   Product  Units
 Date                                                 
 2015-02-04 21:52:45  Acme Coporation  Hardware     14
 2015-02-07 22:58:10  Acme Coporation  Hardware      1
 2015-02-19 10:59:33        Mediacore  Hardware     16
 2015-02-02 20:54:49        Mediacore  Hardware      9
 2015-02-21 20:41:47            Hooli  Hardware      3,
                              Company   Product  Units
 Date                                                 
 2015-02-16 12:09:19            Hooli  Software     10
 2015-02-03 14:14:18          Initech  Software     13
 2015-02-02 08:33:01            Hooli  Software      3
 2015-02-05 01:53:06  Acme Coporation  Software     19
 2015-02-11 20:03:08          Initech  Software      7
 2015-02-09 13:09:55        Mediacore  Software      7
 2015-02-11 22:50:44            Hooli  Software      4
 2015-02-04 15:36:29        Streeplex  Software     13
 2015-02-21 05:01:26        Mediacore  Software      3,
                        Company  Product  Units
 Date                                          
 2015-02-26 08:57:45  Streeplex  Service      4
 2015-02-25 00:29:00    Initech  Service     10
 2015-02-09 08:57:30  Streeplex  Service     19
 2015-02-26 08:58:51  Streeplex  Service      1
 2015-02-05 22:05:03      Hooli  Service     10
 2015-02-19 16:02:58  Mediacore  Service     10]

  
# Concatenate dataframes: february
february = pd.concat(dataframes, axis=1, keys=['Hardware', 'Software', 'Service'], sort=True)

# Print february.info()
february.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20 entries, 2015-02-02 08:33:01 to 2015-02-26 08:58:51
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   (Hardware, Company)  5 non-null      object 
 1   (Hardware, Product)  5 non-null      object 
 2   (Hardware, Units)    5 non-null      float64
 3   (Software, Company)  9 non-null      object 
 4   (Software, Product)  9 non-null      object 
 5   (Software, Units)    9 non-null      float64
 6   (Service, Company)   6 non-null      object 
 7   (Service, Product)   6 non-null      object 
 8   (Service, Units)     6 non-null      float64
dtypes: float64(3), object(6)
memory usage: 1.6+ KB

february

	Hardware			Software			Service
	Company	Product	Units	Company	Product	Units	Company	Product	Units
Date
2015-02-02 08:33:01	NaN	NaN	NaN	Hooli	Software	3.0	NaN	NaN	NaN
2015-02-02 20:54:49	Mediacore	Hardware	9.0	NaN	NaN	NaN	NaN	NaN	NaN
2015-02-03 14:14:18	NaN	NaN	NaN	Initech	Software	13.0	NaN	NaN	NaN
2015-02-04 15:36:29	NaN	NaN	NaN	Streeplex	Software	13.0	NaN	NaN	NaN
2015-02-04 21:52:45	Acme Coporation	Hardware	14.0	NaN	NaN	NaN	NaN	NaN	NaN
2015-02-05 01:53:06	NaN	NaN	NaN	Acme Coporation	Software	19.0	NaN	NaN	NaN
2015-02-05 22:05:03	NaN	NaN	NaN	NaN	NaN	NaN	Hooli	Service	10.0
2015-02-07 22:58:10	Acme Coporation	Hardware	1.0	NaN	NaN	NaN	NaN	NaN	NaN
2015-02-09 08:57:30	NaN	NaN	NaN	NaN	NaN	NaN	Streeplex	Service	19.0
2015-02-09 13:09:55	NaN	NaN	NaN	Mediacore	Software	7.0	NaN	NaN	NaN
2015-02-11 20:03:08	NaN	NaN	NaN	Initech	Software	7.0	NaN	NaN	NaN
2015-02-11 22:50:44	NaN	NaN	NaN	Hooli	Software	4.0	NaN	NaN	NaN
2015-02-16 12:09:19	NaN	NaN	NaN	Hooli	Software	10.0	NaN	NaN	NaN
2015-02-19 10:59:33	Mediacore	Hardware	16.0	NaN	NaN	NaN	NaN	NaN	NaN
2015-02-19 16:02:58	NaN	NaN	NaN	NaN	NaN	NaN	Mediacore	Service	10.0
2015-02-21 05:01:26	NaN	NaN	NaN	Mediacore	Software	3.0	NaN	NaN	NaN
2015-02-21 20:41:47	Hooli	Hardware	3.0	NaN	NaN	NaN	NaN	NaN	NaN
2015-02-25 00:29:00	NaN	NaN	NaN	NaN	NaN	NaN	Initech	Service	10.0
2015-02-26 08:57:45	NaN	NaN	NaN	NaN	NaN	NaN	Streeplex	Service	4.0
2015-02-26 08:58:51	NaN	NaN	NaN	NaN	NaN	NaN	Streeplex	Service	1.0

  
# Assign pd.IndexSlice: idx
idx = pd.IndexSlice

# Create the slice: slice_2_8
slice_2_8 = february.loc['2015-02-02':'2015-02-08', idx[:, 'Company']]

# Print slice_2_8
slice_2_8

	Hardware	Software	Service
	Company	Company	Company
Date
2015-02-02 08:33:01	NaN	Hooli	NaN
2015-02-02 20:54:49	Mediacore	NaN	NaN
2015-02-03 14:14:18	NaN	Initech	NaN
2015-02-04 15:36:29	NaN	Streeplex	NaN
2015-02-04 21:52:45	Acme Coporation	NaN	NaN
2015-02-05 01:53:06	NaN	Acme Coporation	NaN
2015-02-05 22:05:03	NaN	NaN	Hooli
2015-02-07 22:58:10	Acme Coporation	NaN	NaN

Concatenating DataFrames from a dict

You’re now going to revisit the sales data you worked with earlier in the chapter. Three DataFrames jan, feb, and mar have been pre-loaded for you. Your task is to aggregate the sum of all sales over the 'Company' column into a single DataFrame. You’ll do this by constructing a dictionary of these DataFrames and then concatenating them.

Instructions

Create a list called month_list consisting of the tuples ('january', jan), ('february', feb), and ('march', mar).
Create an empty dictionary called month_dict.
Inside the for loop:
- Group month_data by 'Company' and use .sum() to aggregate.
Construct a new DataFrame called sales by concatenating the DataFrames stored in month_dict.
Create an alias for pd.IndexSlice and print all sales by 'Mediacore'. This has been done for you, so hit ‘Submit Answer’ to see the result!

  
jan = pd.read_csv(sales_jan_2015_file)
feb = pd.read_csv(sales_feb_2015_file)
mar = pd.read_csv(sales_mar_2015_file)

mar

	Date	Company	Product	Units
0	2015-03-22 14:42:25	Mediacore	Software	6
1	2015-03-12 18:33:06	Initech	Service	19
2	2015-03-22 03:58:28	Streeplex	Software	8
3	2015-03-15 00:53:12	Hooli	Hardware	19
4	2015-03-17 19:25:37	Hooli	Hardware	10
5	2015-03-16 05:54:06	Mediacore	Software	3
6	2015-03-25 10:18:10	Initech	Hardware	9
7	2015-03-25 16:42:42	Streeplex	Hardware	12
8	2015-03-26 05:20:04	Streeplex	Software	3
9	2015-03-06 10:11:45	Mediacore	Software	17
10	2015-03-22 21:14:39	Initech	Hardware	11
11	2015-03-17 19:38:12	Hooli	Hardware	8
12	2015-03-28 19:20:38	Acme Coporation	Service	5
13	2015-03-13 04:41:32	Streeplex	Hardware	8
14	2015-03-06 02:03:56	Mediacore	Software	17
15	2015-03-13 11:40:16	Initech	Software	11
16	2015-03-27 08:29:45	Mediacore	Software	6
17	2015-03-21 06:42:41	Mediacore	Hardware	19
18	2015-03-15 08:50:45	Initech	Hardware	18
19	2015-03-13 16:25:24	Streeplex	Software	9

  
# Make the list of tuples: month_list
month_list = [('january', jan), ('february', feb), ('march', mar)]

# Create an empty dictionary: month_dict
month_dict = dict()

for month_name, month_data in month_list:

    # Group month_data: month_dict[month_name]
    month_dict[month_name] = month_data.groupby(['Company']).sum()

# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)

# Print sales
display(sales)

# Print all sales by Mediacore
idx = pd.IndexSlice
display(sales.loc[idx[:, 'Mediacore'], :])

		Date	Product	Units
	Company
january	Acme Coporation	2015-01-01 07:31:202015-01-20 19:49:242015-01-...	SoftwareHardwareSoftwareServiceSoftware	76
	Hooli	2015-01-02 09:51:062015-01-11 14:51:022015-01-...	HardwareHardwareServiceServiceHardware	70
	Initech	2015-01-06 17:19:342015-01-24 08:01:162015-01-...	HardwareSoftwareServiceService	37
	Mediacore	2015-01-15 15:33:402015-01-16 19:20:46	HardwareService	15
	Streeplex	2015-01-21 19:13:212015-01-09 05:23:512015-01-...	HardwareServiceServiceSoftware	50
february	Acme Coporation	2015-02-05 01:53:062015-02-04 21:52:452015-02-...	SoftwareHardwareHardware	34
	Hooli	2015-02-16 12:09:192015-02-02 08:33:012015-02-...	SoftwareSoftwareSoftwareServiceHardware	30
	Initech	2015-02-03 14:14:182015-02-25 00:29:002015-02-...	SoftwareServiceSoftware	30
	Mediacore	2015-02-09 13:09:552015-02-19 16:02:582015-02-...	SoftwareServiceHardwareHardwareSoftware	45
	Streeplex	2015-02-26 08:57:452015-02-09 08:57:302015-02-...	ServiceServiceServiceSoftware	37
march	Acme Coporation	2015-03-28 19:20:38	Service	5
	Hooli	2015-03-15 00:53:122015-03-17 19:25:372015-03-...	HardwareHardwareHardware	37
	Initech	2015-03-12 18:33:062015-03-25 10:18:102015-03-...	ServiceHardwareHardwareSoftwareHardware	68
	Mediacore	2015-03-22 14:42:252015-03-16 05:54:062015-03-...	SoftwareSoftwareSoftwareSoftwareSoftwareHardware	68
	Streeplex	2015-03-22 03:58:282015-03-25 16:42:422015-03-...	SoftwareHardwareSoftwareHardwareSoftware	40

		Date	Product	Units
	Company
january	Mediacore	2015-01-15 15:33:402015-01-16 19:20:46	HardwareService	15
february	Mediacore	2015-02-09 13:09:552015-02-19 16:02:582015-02-...	SoftwareServiceHardwareHardwareSoftware	45
march	Mediacore	2015-03-22 14:42:252015-03-16 05:54:062015-03-...	SoftwareSoftwareSoftwareSoftwareSoftwareHardware	68

  
del medal_types, medal_list, medal_df, medals, medals_sorted, idx, hw, sw, sv, dataframes, february, slice_2_8

Outer & inner joins

Using with arrays

  
A = np.arange(8).reshape(2, 4) + 0.1
A

array([[0.1, 1.1, 2.1, 3.1],
       [4.1, 5.1, 6.1, 7.1]])

  
B = np.arange(6).reshape(2,3) + 0.2
B

array([[0.2, 1.2, 2.2],
       [3.2, 4.2, 5.2]])

  
C = np.arange(12).reshape(3,4) + 0.3
C

array([[ 0.3,  1.3,  2.3,  3.3],
       [ 4.3,  5.3,  6.3,  7.3],
       [ 8.3,  9.3, 10.3, 11.3]])

Stacking arrays horizontally

  
np.hstack([B, A])

array([[0.2, 1.2, 2.2, 0.1, 1.1, 2.1, 3.1],
       [3.2, 4.2, 5.2, 4.1, 5.1, 6.1, 7.1]])

  
np.concatenate([B, A], axis=1)

array([[0.2, 1.2, 2.2, 0.1, 1.1, 2.1, 3.1],
       [3.2, 4.2, 5.2, 4.1, 5.1, 6.1, 7.1]])

Stacking arrays vertically

  
np.vstack([A, C])

array([[ 0.1,  1.1,  2.1,  3.1],
       [ 4.1,  5.1,  6.1,  7.1],
       [ 0.3,  1.3,  2.3,  3.3],
       [ 4.3,  5.3,  6.3,  7.3],
       [ 8.3,  9.3, 10.3, 11.3]])

  
np.concatenate([A, C], axis=0)

array([[ 0.1,  1.1,  2.1,  3.1],
       [ 4.1,  5.1,  6.1,  7.1],
       [ 0.3,  1.3,  2.3,  3.3],
       [ 4.3,  5.3,  6.3,  7.3],
       [ 8.3,  9.3, 10.3, 11.3]])

Incompatible array dimensions

  
np.concatenate([A, B], axis=0) # incompatible columns

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

Cell In[92], line 1
----> 1 np.concatenate([A, B], axis=0)

ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 4 and the array at index 1 has size 3

  
np.concatenate([A, C], axis=1) # incompatible rows

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

Cell In[93], line 1
----> 1 np.concatenate([A, C], axis=1)

ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 2 and the array at index 1 has size 3

Population & unemployment data

  
population = pd.read_csv('population_00.csv', index_col=0)

unemployment = pd.read_csv('unemployment_00.csv', index_col=0)
print(population)
2010 Census Population
Zip Code ZCTA
57538 322
59916 130
37660 40038
2860 45199

print(unemployment)
unemployment participants
Zip
2860 0.11 34447
46167 0.02 4800
1097 0.33 42
80808 0.07 4310

Converting to arrays

  
population_array = np.array(population)
print(population_array) # Index info is lost
[[ 322]
[ 130]
[40038]
[45199]]

unemployment_array = np.array(unemployment)
print(population_array)
[[ 1.10000000e-01 3.44470000e+04]
[ 2.00000000e-02 4.80000000e+03]
[ 3.30000000e-01 4.20000000e+01]
[ 7.00000000e-02 4.31000000e+03]]

Manipulating data as arrays

  
print(np.concatenate([population_array, unemployment_array], axis=1))
[[ 3.22000000e+02 1.10000000e-01 3.44470000e+04]
[ 1.30000000e+02 2.00000000e-02 4.80000000e+03]
[ 4.00380000e+04 3.30000000e-01 4.20000000e+01]
[ 4.51990000e+04 7.00000000e-02 4.31000000e+03]]

Joins

Joining tables: Combining rows of multiple tables
Outer join
- Union of index sets (all labels, no repetition)
- Missing fields filled with NaN
- Preserves the indices in the original tables, filling null values for missing rows
- Has all the indices of the original tables without repetiton (like a set union)
Inner join
- Intersection of index sets (only common labels)
- Has only labels common to both tables (like a set intersection)

Concatenation & inner join

only the row label present in both DataFrames is preserved

  
pd.concat([population, unemployment], axis=1, join='inner')

2010 Census Population unemployment participants
2860 45199 0.11 34447

Concatenation & outer join

All row indiecs from the original two indexes exist in the joind DataFrame index.
When a row occurs in one DataFrame, but not in the other, the missing column entries are filled with null values

  
pd.concat([population, unemployment], axis=1, join='outer')

Census Population unemployment participants
NaN 0.33 42.0
45199.0 0.11 34447.0
40038.0 NaN NaN
NaN 0.02 4800.0
322.0 NaN NaN
130.0 NaN NaN
NaN 0.07 4310.0

Inner join on other axis

The resulting DataFrame is empty becasue no column index label appears in both population and unemployment

  
pd.concat([population, unemployment], join='inner', axis=0)

Empty DataFrame
Columns: []
Index: [2860, 46167, 1097, 80808, 57538, 59916, 37660, 2860]

Exercises

Concatenating DataFrames with inner join

Here, you’ll continue working with DataFrames compiled from The Guardian’s Olympic medal dataset.

The DataFrames bronze, silver, and gold have been pre-loaded for you.

Your task is to compute an inner join.

Instructions

Construct a list of DataFrames called medal_list with entries bronze, silver, and gold.
Concatenate medal_list horizontally with an inner join to create medals.
- Use the keyword argument keys=['bronze', 'silver', 'gold'] to yield suitable hierarchical indexing.
- Use axis=1 to get horizontal concatenation.
- Use join='inner' to keep only rows that share common index labels.
Print the new DataFrame medals.

  
bronze = pd.read_csv(so_bronze5_file)
silver = pd.read_csv(so_silver_file)
gold = pd.read_csv(so_gold_file)

  
# Create the list of DataFrames: medal_list
medal_list = [bronze, silver, gold]

# Concatenate medal_list horizontally using an inner join: medals
medals = pd.concat(medal_list, keys=['bronze', 'silver', 'gold'], axis=1, join='inner')

# Print medals
medals

	bronze		silver			gold
	Country	Total	NOC	Country	Total	NOC	Country	Total
0	United States	1052.0	USA	United States	1195.0	USA	United States	2088.0
1	Soviet Union	584.0	URS	Soviet Union	627.0	URS	Soviet Union	838.0
2	United Kingdom	505.0	GBR	United Kingdom	591.0	GBR	United Kingdom	498.0
3	France	475.0	FRA	France	461.0	FRA	France	378.0
4	Germany	454.0	GER	Germany	350.0	GER	Germany	407.0

Resampling & concatenating DataFrames with inner join

In this exercise, you’ll compare the historical 10-year GDP (Gross Domestic Product) growth in the US and in China. The data for the US starts in 1947 and is recorded quarterly; by contrast, the data for China starts in 1961 and is recorded annually.

You’ll need to use a combination of resampling and an inner join to align the index labels. You’ll need an appropriate offset alias for resampling, and the method .resample() must be chained with some kind of aggregation method (.pct_change() and .last() in this case).

pandas has been imported as pd, and the DataFrames china and us have been pre-loaded, with the output of china.head() and us.head() printed in the IPython Shell.

Instructions

Make a new DataFrame china_annual by resampling the DataFrame china with .resample('A').last() (i.e., with annual frequency) and chaining two method calls:
Chain .pct_change(10) as an aggregation method to compute the percentage change with an offset of ten years.
Chain .dropna() to eliminate rows containing null values.
Make a new DataFrame us_annual by resampling the DataFrame us exactly as you resampled china.
Concatenate china_annual and us_annual to construct a DataFrame called gdp. Use join='inner' to perform an inner join and use axis=1 to concatenate horizontally.
Print the result of resampling gdp every decade (i.e., using .resample('10A')) and aggregating with the method .last(). This has been done for you, so hit ‘Submit Answer’ to see the result!

  
china = pd.read_csv(gdp_china_file, parse_dates=['Year'])
china.rename(columns={'GDP': 'China'}, inplace=True)
china.set_index('Year', inplace=True)

us = pd.read_csv(gdp_usa_file, parse_dates=['DATE'])
us.rename(columns={'DATE': 'Year', 'VALUE': 'US'}, inplace=True)
us.set_index('Year', inplace=True)

  
china.head()

	China
Year
1960-01-01	59.184116
1961-01-01	49.557050
1962-01-01	46.685179
1963-01-01	50.097303
1964-01-01	59.062255

  
us.head()

	US
Year
1947-01-01	243.1
1947-04-01	246.3
1947-07-01	250.1
1947-10-01	260.3
1948-01-01	266.2

  
# Resample and tidy china: china_annual
china_annual = china.resample('YE').last().pct_change(10).dropna()
china_annual.head()

	China
Year
1970-12-31	0.546128
1971-12-31	0.988860
1972-12-31	1.402472
1973-12-31	1.730085
1974-12-31	1.408556

  
# Resample and tidy us: us_annual
us_annual = us.resample('YE').last().pct_change(10).dropna()
us_annual.head()

	US
Year
1957-12-31	0.827507
1958-12-31	0.782686
1959-12-31	0.953137
1960-12-31	0.689354
1961-12-31	0.630959

  
# Concatenate china_annual and us_annual: gdp
gdp = pd.concat([china_annual, us_annual], join='inner', axis=1)

# Resample gdp and print
gdp.resample('10YE').last()

	China	US
Year
1970-12-31	0.546128	1.017187
1980-12-31	1.072537	1.742556
1990-12-31	0.892820	1.012126
2000-12-31	2.357522	0.738632
2010-12-31	4.011081	0.454332
2020-12-31	3.789936	0.361780

  
del bronze, silver, gold, medal_list, medals, china, us, china_annual, us_annual, gdp

Merging Data

Here, you’ll learn all about merging pandas DataFrames. You’ll explore different techniques for merging, and learn about left joins, right joins, inner joins, and outer joins, as well as when to use which. You’ll also learn about ordered merging, which is useful when you want to merge DataFrames whose columns have natural orderings, like date-time columns.

merge() extends concat() with the ability to align rows using multiple columns

Merging DataFrames

  
pa_zipcode_population = {'Zipcode': [16855, 15681, 18657, 17307, 15635],
                         '2010 Census Population': [282, 5241, 11985, 5899, 220]}
pa_zipcode_city = {'Zipcode': [17545,18455, 17307, 15705, 16833, 16220, 18618, 16855, 16623, 15635, 15681, 18657, 15279, 17231, 18821],
                   'City': ['MANHEIM', 'PRESTON PARK', 'BIGLERVILLE', 'INDIANA', 'CURWENSVILLE', 'CROWN', 'HARVEYS LAKE', 'MINERAL SPRINGS',
                            'CASSVILLE', 'HANNASTOWN', 'SALTSBURG', 'TUNKHANNOCK', 'PITTSBURG', 'LEMASTERS', 'GREAT BEND'],
                   'State': ['PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA']}

Population DataFrame

  
population = pd.DataFrame.from_dict(pa_zipcode_population)
population

	Zipcode	2010 Census Population
0	16855	282
1	15681	5241
2	18657	11985
3	17307	5899
4	15635	220

Cities DataFrame

  
cities = pd.DataFrame.from_dict(pa_zipcode_city)
cities

	Zipcode	City	State
0	17545	MANHEIM	PA
1	18455	PRESTON PARK	PA
2	17307	BIGLERVILLE	PA
3	15705	INDIANA	PA
4	16833	CURWENSVILLE	PA
5	16220	CROWN	PA
6	18618	HARVEYS LAKE	PA
7	16855	MINERAL SPRINGS	PA
8	16623	CASSVILLE	PA
9	15635	HANNASTOWN	PA
10	15681	SALTSBURG	PA
11	18657	TUNKHANNOCK	PA
12	15279	PITTSBURG	PA
13	17231	LEMASTERS	PA
14	18821	GREAT BEND	PA

Merging

pd.merge() computes a merge on ALL columns that occur in both DataFrames
- in the following case, the common column is Zipcode
- for any row in which the Zipcode entry in cities matches a row in population, a new row is made in the merfed DataFrame.
- by default, this is an inner join
  - it’s an inner join because it glues together only rows that match in the joining columns of BOTH DataFrames

  
pd.merge(population, cities)

	Zipcode	2010 Census Population	City	State
0	16855	282	MINERAL SPRINGS	PA
1	15681	5241	SALTSBURG	PA
2	18657	11985	TUNKHANNOCK	PA
3	17307	5899	BIGLERVILLE	PA
4	15635	220	HANNASTOWN	PA

Medal DataFrames

  
bronze = pd.read_csv(so_bronze_file)
bronze.head()

	NOC	Country	Total
0	USA	United States	1052.0
1	URS	Soviet Union	584.0
2	GBR	United Kingdom	505.0
3	FRA	France	475.0
4	GER	Germany	454.0

  
len(bronze)

138

  
gold = pd.read_csv(so_gold_file)
gold.head()

	NOC	Country	Total
0	USA	United States	2088.0
1	URS	Soviet Union	838.0
2	GBR	United Kingdom	498.0
3	FRA	France	378.0
4	GER	Germany	407.0

  
len(gold)

138

Merging all columns

by default, pd.merge() uses all columns common to both DataFrames to merge
the rows of the merged DataFrame consist of all rows where the NOC, Country, and Totals columns are identical in both DataFrames

  
so_merge = pd.merge(bronze, gold)
so_merge.head()

	NOC	Country	Total
0	ESP	Spain	92.0
1	IRL	Ireland	8.0
2	SYR	Syria	1.0
3	MOZ	Mozambique	1.0
4	SUR	Suriname	1.0

  
len(so_merge)

18

  
so_merge.columns

Index(['NOC', 'Country', 'Total'], dtype='object')

  
so_merge.index

RangeIndex(start=0, stop=18, step=1)

Merging on

  
so_merge = pd.merge(bronze, gold, on='NOC')
so_merge.head()

	NOC	Country_x	Total_x	Country_y	Total_y
0	USA	United States	1052.0	United States	2088.0
1	URS	Soviet Union	584.0	Soviet Union	838.0
2	GBR	United Kingdom	505.0	United Kingdom	498.0
3	FRA	France	475.0	France	378.0
4	GER	Germany	454.0	Germany	407.0

  
len(so_merge)

138

Merging on multiple columns

this is where merging extend concatenation in allowing matching on multiple columns

  
so_merge = pd.merge(bronze, gold, on=['NOC', 'Country'])
so_merge.head()

	NOC	Country	Total_x	Total_y
0	USA	United States	1052.0	2088.0
1	URS	Soviet Union	584.0	838.0
2	GBR	United Kingdom	505.0	498.0
3	FRA	France	475.0	378.0
4	GER	Germany	454.0	407.0

Using suffixes

  
so_merge = pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'])
so_merge.head()

	NOC	Country	Total_bronze	Total_gold
0	USA	United States	1052.0	2088.0
1	URS	Soviet Union	584.0	838.0
2	GBR	United Kingdom	505.0	498.0
3	FRA	France	475.0	378.0
4	GER	Germany	454.0	407.0

Counties DataFrame

  
pa_counties = {'CITY NAME': ['SALTSBURG', 'MINERAL SPRINGS', 'BIGLERVILLE', 'HANNASTOWN', 'TUNKHANNOCK'],
               'COUNTY NAME': ['INDIANA', 'CLEARFIELD', 'ADAMS', 'WESTMORELAND', 'WYOMING']}
counties = pd.DataFrame.from_dict(pa_counties)
counties

	CITY NAME	COUNTY NAME
0	SALTSBURG	INDIANA
1	MINERAL SPRINGS	CLEARFIELD
2	BIGLERVILLE	ADAMS
3	HANNASTOWN	WESTMORELAND
4	TUNKHANNOCK	WYOMING

  
cities.tail()

	Zipcode	City	State
10	15681	SALTSBURG	PA
11	18657	TUNKHANNOCK	PA
12	15279	PITTSBURG	PA
13	17231	LEMASTERS	PA
14	18821	GREAT BEND	PA

Specifying columns to merge

  
pd.merge(counties, cities, left_on='CITY NAME', right_on='City')

	CITY NAME	COUNTY NAME	Zipcode	City	State
0	SALTSBURG	INDIANA	15681	SALTSBURG	PA
1	MINERAL SPRINGS	CLEARFIELD	16855	MINERAL SPRINGS	PA
2	BIGLERVILLE	ADAMS	17307	BIGLERVILLE	PA
3	HANNASTOWN	WESTMORELAND	15635	HANNASTOWN	PA
4	TUNKHANNOCK	WYOMING	18657	TUNKHANNOCK	PA

Switching left/right DataFrames

  
pd.merge(cities, counties, left_on='City', right_on='CITY NAME')

	Zipcode	City	State	CITY NAME	COUNTY NAME
0	17307	BIGLERVILLE	PA	BIGLERVILLE	ADAMS
1	16855	MINERAL SPRINGS	PA	MINERAL SPRINGS	CLEARFIELD
2	15635	HANNASTOWN	PA	HANNASTOWN	WESTMORELAND
3	15681	SALTSBURG	PA	SALTSBURG	INDIANA
4	18657	TUNKHANNOCK	PA	TUNKHANNOCK	WYOMING

  
del pa_zipcode_population, pa_zipcode_city, population, cities, bronze, gold, so_merge, pa_counties, counties

Exercises

Merging company DataFrames

Suppose your company has operations in several different cities under several different managers. The DataFrames revenue and managers contain partial information related to the company. That is, the rows of the city columns don’t quite match in revenue and managers (the Mendocino branch has no revenue yet since it just opened and the manager of Springfield branch recently left the company).

The DataFrames have been printed in the IPython Shell. If you were to run the command combined = pd.merge(revenue, managers, on='city'), how many rows would combined have?

  
rev = {'city': ['Austin', 'Denver', 'Springfield'], 'revenue': [100, 83, 4]}
man = {'city': ['Austin', 'Denver', 'Mendocino'], 'manager': ['Charles', 'Joel', 'Brett']}

revenue = pd.DataFrame.from_dict(rev)
managers = pd.DataFrame.from_dict(man)

  
combined = pd.merge(revenue, managers, on='city')
combined

	city	revenue	manager
0	Austin	100	Charles
1	Denver	83	Joel

Merging on a specific column

This exercise follows on the last one with the DataFrames revenue and managers for your company. You expect your company to grow and, eventually, to operate in cities with the same name on different states. As such, you decide that every branch should have a numerical branch identifier. Thus, you add a branch_id column to both DataFrames. Moreover, new cities have been added to both the revenue and managers DataFrames as well. pandas has been imported as pd and both DataFrames are available in your namespace.

At present, there should be a 1-to-1 relationship between the city and branch_id fields. In that case, the result of a merge on the city columns ought to give you the same output as a merge on the branch_id columns. Do they? Can you spot an ambiguity in one of the DataFrames?

Instructions

Using pd.merge(), merge the DataFrames revenue and managers on the 'city' column of each. Store the result as merge_by_city.
Print the DataFrame merge_by_city. This has been done for you.
Merge the DataFrames revenue and managers on the 'branch_id' column of each. Store the result as merge_by_id.
Print the DataFrame merge_by_id. This has been done for you, so hit ‘Submit Answer’ to see the result!

  
rev = {'city': ['Austin', 'Denver', 'Springfield', 'Mendocino'], 'revenue': [100, 83, 4, 200], 'branch_id': [10, 20, 30, 47]}
man = {'city': ['Austin', 'Denver', 'Mendocino', 'Springfield'], 'manager': ['Charles', 'Joel', 'Brett', 'Sally'], 'branch_id': [10, 20, 47, 31]}

revenue = pd.DataFrame.from_dict(rev)
managers = pd.DataFrame.from_dict(man)

  
# Merge revenue with managers on 'city': merge_by_city
merge_by_city = pd.merge(revenue, managers, on='city')

# Print merge_by_city
merge_by_city

	city	revenue	branch_id_x	manager	branch_id_y
0	Austin	100	10	Charles	10
1	Denver	83	20	Joel	20
2	Springfield	4	30	Sally	31
3	Mendocino	200	47	Brett	47

  
# Merge revenue with managers on 'branch_id': merge_by_id
merge_by_id = pd.merge(revenue, managers, on='branch_id')

# Print merge_by_id
merge_by_id

	city_x	revenue	branch_id	city_y	manager
0	Austin	100	10	Austin	Charles
1	Denver	83	20	Denver	Joel
2	Mendocino	200	47	Mendocino	Brett

Notice that when you merge on 'city', the resulting DataFrame has a peculiar result: In row 2, the city Springfield has two different branch IDs. This is because there are actually two different cities named Springfield - one in the State of Illinois, and the other in Missouri. The revenue DataFrame has the one from Illinois, and the managers DataFrame has the one from Missouri. Consequently, when you merge on 'branch_id', both of these get dropped from the merged DataFrame.

Merging on columns with non-matching labels

You continue working with the revenue & managers DataFrames from before. This time, someone has changed the field name 'city' to 'branch' in the managers table. Now, when you attempt to merge DataFrames, an exception is thrown:

  
>>> pd.merge(revenue, managers, on='city')
Traceback (most recent call last):
    ... <text deleted> ...
    pd.merge(revenue, managers, on='city')
    ... <text deleted> ...
KeyError: 'city'

Given this, it will take a bit more work for you to join or merge on the city/branch name. You have to specify the left_on and right_on parameters in the call to pd.merge().

As before, pandas has been pre-imported as pd and the revenue and managers DataFrames are in your namespace. They have been printed in the IPython Shell so you can examine the columns prior to merging.

Are you able to merge better than in the last exercise? How should the rows with Springfield be handled?

Instructions

Merge the DataFrames revenue and managers into a single DataFrame called combined using the 'city' and 'branch' columns from the appropriate DataFrames.
- In your call to pd.merge(), you will have to specify the parameters left_on and right_on appropriately.
Print the new DataFrame combined.

  
state_rev = {'Austin': 'TX', 'Denver': 'CO', 'Springfield': 'IL', 'Mendocino': 'CA'}
state_man = {'Austin': 'TX', 'Denver': 'CO', 'Mendocino': 'CA', 'Springfield': 'MO'}

  
revenue['state'] = revenue['city'].map(state_rev)
managers['state'] = managers['city'].map(state_man)

  
managers.rename(columns={'city': 'branch'}, inplace=True)

revenue

	city	revenue	branch_id	state
0	Austin	100	10	TX
1	Denver	83	20	CO
2	Springfield	4	30	IL
3	Mendocino	200	47	CA

managers

	branch	manager	branch_id	state
0	Austin	Charles	10	TX
1	Denver	Joel	20	CO
2	Mendocino	Brett	47	CA
3	Springfield	Sally	31	MO

  
combined = pd.merge(revenue, managers, left_on='city', right_on='branch')
combined

	city	revenue	branch_id_x	state_x	branch	manager	branch_id_y	state_y
0	Austin	100	10	TX	Austin	Charles	10	TX
1	Denver	83	20	CO	Denver	Joel	20	CO
2	Springfield	4	30	IL	Springfield	Sally	31	MO
3	Mendocino	200	47	CA	Mendocino	Brett	47	CA

Merging on multiple columns

Another strategy to disambiguate cities with identical names is to add information on the states in which the cities are located. To this end, you add a column called state to both DataFrames from the preceding exercises. Again, pandas has been pre-imported as pd and the revenue and managers DataFrames are in your namespace.

Your goal in this exercise is to use pd.merge() to merge DataFrames using multiple columns (using 'branch_id', 'city', and 'state' in this case).

Are you able to match all your company’s branches correctly?

Instructions

Create a column called 'state' in the DataFrame revenue, consisting of the list ['TX','CO','IL','CA'].
Create a column called 'state' in the DataFrame managers, consisting of the list ['TX','CO','CA','MO'].
Merge the DataFrames revenue and managers using three columns :'branch_id', 'city', and 'state'. Pass them in as a list to the on paramater of pd.merge().

  
managers.rename(columns={'branch': 'city'}, inplace=True)

  
# Add 'state' column to revenue: revenue['state']
revenue['state'] = ['TX','CO','IL','CA']

# Add 'state' column to managers: managers['state']
managers['state'] = ['TX','CO','CA','MO']

# Merge revenue & managers on 'branch_id', 'city', & 'state': combined
combined = pd.merge(revenue, managers, on=['branch_id', 'city', 'state'])

# Print combined
print(combined)

        city  revenue  branch_id state  manager
   Austin      100         10    TX  Charles
   Denver       83         20    CO     Joel
Mendocino      200         47    CA    Brett

  
del rev, man, revenue, managers, merge_by_city, merge_by_id, combined

Joining DataFrames

Pandas has to search through DataFrame rows for matches when computing joins and merges
- It’s useful to have different kinds of joins to mitigate costs

Medal DataFrames

  
bronze = pd.read_csv(so_bronze_file)
bronze.head()

	NOC	Country	Total
0	USA	United States	1052.0
1	URS	Soviet Union	584.0
2	GBR	United Kingdom	505.0
3	FRA	France	475.0
4	GER	Germany	454.0

  
len(bronze)

138

  
gold = pd.read_csv(so_gold_file)
gold.head()

	NOC	Country	Total
0	USA	United States	2088.0
1	URS	Soviet Union	838.0
2	GBR	United Kingdom	498.0
3	FRA	France	378.0
4	GER	Germany	407.0

  
len(gold)

138

Merging with inner join

merge() does an inner join by default
- it extracts the rows that match in joining columns from both DataFrames and it glues them together in the joined DataFrame
- the property how=innner is the default behavior

  
so_merge = pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='inner')
so_merge.head()

	NOC	Country	Total_bronze	Total_gold
0	USA	United States	1052.0	2088.0
1	URS	Soviet Union	584.0	838.0
2	GBR	United Kingdom	505.0	498.0
3	FRA	France	475.0	378.0
4	GER	Germany	454.0	407.0

Merging with left join

using how=left keeps all rows of the left DataFrame in the merged DataFrame
Keeps all rows of the left DF in the merged DF
For rows in the left DF with matches in the right DF:
- Non-joining columns of right DF are appended to left DF
For rows in the left DF with no matches in the right DF:
- Non-joining columns are filled with nulls

  
bronze = pd.read_csv(so_bronze5_file)
gold = pd.read_csv(so_gold5_file)

  
g_noc = ['USA', 'URS', 'GBR', 'ITA', 'GER']
b_noc = ['USA', 'URS', 'GBR', 'FRA', 'GER']

  
gold['NOC'] = g_noc
bronze['NOC'] = b_noc

gold

	Country	Total	NOC
0	United States	2088.0	USA
1	Soviet Union	838.0	URS
2	United Kingdom	498.0	GBR
3	Italy	460.0	ITA
4	Germany	407.0	GER

bronze

	Country	Total	NOC
0	United States	1052.0	USA
1	Soviet Union	584.0	URS
2	United Kingdom	505.0	GBR
3	France	475.0	FRA
4	Germany	454.0	GER

  
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='left')

	Country	Total_bronze	NOC	Total_gold
0	United States	1052.0	USA	2088.0
1	Soviet Union	584.0	URS	838.0
2	United Kingdom	505.0	GBR	498.0
3	France	475.0	FRA	NaN
4	Germany	454.0	GER	407.0

Merging with right join

  
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='right')

	Country	Total_bronze	NOC	Total_gold
0	United States	1052.0	USA	2088.0
1	Soviet Union	584.0	URS	838.0
2	United Kingdom	505.0	GBR	498.0
3	Italy	NaN	ITA	460.0
4	Germany	454.0	GER	407.0

Merging with outer join

  
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='outer')

	Country	Total_bronze	NOC	Total_gold
0	France	475.0	FRA	NaN
1	United Kingdom	505.0	GBR	498.0
2	Germany	454.0	GER	407.0
3	Italy	NaN	ITA	460.0
4	Soviet Union	584.0	URS	838.0
5	United States	1052.0	USA	2088.0

Population & unemployment data

  
population = pd.DataFrame.from_dict({'Zip Code ZCTA': [57538, 59916, 37660, 2860],
                                     '2010 Census Population': [322, 130, 40038, 45199]})
population.set_index('Zip Code ZCTA', inplace=True)
population

	2010 Census Population
Zip Code ZCTA
57538	322
59916	130
37660	40038
2860	45199

  
unemployment = pd.DataFrame.from_dict({'Zip': [2860, 46167, 1097],
                                       'unemployment': [0.11, 0.02, 0.33],
                                       'participants': [ 34447, 4800, 32]})
unemployment.set_index('Zip', inplace=True)
unemployment

	unemployment	participants
Zip
2860	0.11	34447
46167	0.02	4800
1097	0.33	32

Using .join(how=’left’)

computes a left join using the Index by default

  
population.join(unemployment)

	2010 Census Population	unemployment	participants
Zip Code ZCTA
57538	322	NaN	NaN
59916	130	NaN	NaN
37660	40038	NaN	NaN
2860	45199	0.11	34447.0

Using .join(how=’right’)

  
population.join(unemployment, how='right')

	2010 Census Population	unemployment	participants
Zip
2860	45199.0	0.11	34447
46167	NaN	0.02	4800
1097	NaN	0.33	32

Using .join(how=’inner’)

  
population.join(unemployment, how='inner')

	2010 Census Population	unemployment	participants
2860	45199	0.11	34447

Using .join(how=’outer’)

  
population.join(unemployment, how='outer')

	2010 Census Population	unemployment	participants
1097	NaN	0.33	32.0
2860	45199.0	0.11	34447.0
37660	40038.0	NaN	NaN
46167	NaN	0.02	4800.0
57538	322.0	NaN	NaN
59916	130.0	NaN	NaN

  
del bronze, gold, so_merge, g_noc, b_noc, population, unemployment

Which should you use?

df1.append(df2): stacking vertically
pd.concat([df1, df2]):
- stacking many horizontally or vertically
- simple inner/outer joins on Indexes
df1.join(df2): inner/outer/left/right joins on Indexes
pd.merge([df1, df2]): many joins on multiple columns

Exercises

Data

  
rev = {'city': ['Austin', 'Denver', 'Springfield', 'Mendocino'],
       'state': ['TX','CO','IL','CA'],
       'revenue': [100, 83, 4, 200],
       'branch_id': [10, 20, 30, 47]}

man = {'city': ['Austin', 'Denver', 'Mendocino', 'Springfield'],
       'state': ['TX','CO','CA','MO'],
       'manager': ['Charles', 'Joel', 'Brett', 'Sally'],
       'branch_id': [10, 20, 47, 31]}

revenue = pd.DataFrame.from_dict(rev)
revenue.set_index('branch_id', inplace=True)
managers = pd.DataFrame.from_dict(man)
managers.set_index('branch_id', inplace=True)

revenue

	city	state	revenue
branch_id
10	Austin	TX	100
20	Denver	CO	83
30	Springfield	IL	4
47	Mendocino	CA	200

managers

	city	state	manager
branch_id
10	Austin	TX	Charles
20	Denver	CO	Joel
47	Mendocino	CA	Brett
31	Springfield	MO	Sally

Joining by Index

The DataFrames revenue and managers are displayed in the IPython Shell. Here, they are indexed by 'branch_id'.

Choose the function call below that will join the DataFrames on their indexes and return 5 rows with index labels [10, 20, 30, 31, 47]. Explore each of them in the IPython Shell to get a better understanding of their functionality.

  
revenue.join(managers, lsuffix='_rev', rsuffix='_mng', how='outer')

	city_rev	state_rev	revenue	city_mng	state_mng	manager
branch_id
10	Austin	TX	100.0	Austin	TX	Charles
20	Denver	CO	83.0	Denver	CO	Joel
30	Springfield	IL	4.0	NaN	NaN	NaN
31	NaN	NaN	NaN	Springfield	MO	Sally
47	Mendocino	CA	200.0	Mendocino	CA	Brett

Choosing a joining strategy

Suppose you have two DataFrames: students (with columns 'StudentID', 'LastName', 'FirstName', and 'Major') and midterm_results (with columns 'StudentID', 'Q1', 'Q2', and 'Q3' for their scores on midterm questions).

You want to combine the DataFrames into a single DataFrame grades, and be able to easily spot which students wrote the midterm and which didn’t (their midterm question scores 'Q1', 'Q2', & 'Q3' should be filled with NaN values).

You also want to drop rows from midterm_results in which the StudentID is not found in students.

Which of the following strategies gives the desired result?

  
students = pd.DataFrame.from_dict({'StudentID': [], 'LastName': [], 'FirstName': [], 'Major': []})
midterm_results = pd.DataFrame.from_dict({'StudentID': [], 'Q1': [], 'Q2': [], 'Q3': []})

students

	StudentID	LastName	FirstName	Major

midterm_results

	StudentID	Q1	Q2	Q3

  
grades = pd.merge(students, midterm_results, how='left')

Left & right merging on multiple columns

You now have, in addition to the revenue and managers DataFrames from prior exercises, a DataFrame sales that summarizes units sold from specific branches (identified by city and state but not branch_id).

Once again, the managers DataFrame uses the label branch in place of city as in the other two DataFrames. Your task here is to employ left and right merges to preserve data and identify where data is missing.

By merging revenue and sales with a right merge, you can identify the missing revenue values. Here, you don’t need to specify left_on or right_on because the columns to merge on have matching labels.

By merging sales and managers with a left merge, you can identify the missing manager. Here, the columns to merge on have conflicting labels, so you must specify left_on and right_on. In both cases, you’re looking to figure out how to connect the fields in rows containing Springfield.

pandas has been imported as pd and the three DataFrames revenue, managers, and sales have been pre-loaded. They have been printed for you to explore in the IPython Shell.

Instructions

Execute a right merge using pd.merge() with revenue and sales to yield a new DataFrame revenue_and_sales.
- Use how='right' and on=['city', 'state'].
Print the new DataFrame revenue_and_sales. This has been done for you.
Execute a left merge with sales and managers to yield a new DataFrame sales_and_managers.
- Use how='left', left_on=['city', 'state'], and right_on=['branch', 'state'].
Print the new DataFrame sales_and_managers. This has been done for you, so hit ‘Submit Answer’ to see the result!

  
rev = {'city': ['Austin', 'Denver', 'Springfield', 'Mendocino'],
       'branch_id': [10, 20, 30, 47],
       'state': ['TX','CO','IL','CA'],
       'revenue': [100, 83, 4, 200]}

man = {'branch': ['Austin', 'Denver', 'Mendocino', 'Springfield'],
       'branch_id': [10, 20, 47, 31],
       'state': ['TX','CO','CA','MO'],
       'manager': ['Charles', 'Joel', 'Brett', 'Sally']}

sale = {'city': ['Mendocino', 'Denver', 'Austin', 'Springfield', 'Springfield'],
        'state': ['CA', 'CO', 'TX', 'MO', 'IL'],
        'units': [1, 4, 2, 5, 1]}

revenue = pd.DataFrame.from_dict(rev)
managers = pd.DataFrame.from_dict(man)
sales = pd.DataFrame.from_dict(sale)

revenue

	city	branch_id	state	revenue
0	Austin	10	TX	100
1	Denver	20	CO	83
2	Springfield	30	IL	4
3	Mendocino	47	CA	200

managers

	branch	branch_id	state	manager
0	Austin	10	TX	Charles
1	Denver	20	CO	Joel
2	Mendocino	47	CA	Brett
3	Springfield	31	MO	Sally

sales

	city	state	units
0	Mendocino	CA	1
1	Denver	CO	4
2	Austin	TX	2
3	Springfield	MO	5
4	Springfield	IL	1

  
# Merge revenue and sales: revenue_and_sales
revenue_and_sales = pd.merge(revenue, sales, how='right', on=['city', 'state'])

# Print revenue_and_sales
revenue_and_sales

	city	branch_id	state	revenue	units
0	Mendocino	47.0	CA	200.0	1
1	Denver	20.0	CO	83.0	4
2	Austin	10.0	TX	100.0	2
3	Springfield	NaN	MO	NaN	5
4	Springfield	30.0	IL	4.0	1

  
sales_and_managers = pd.merge(sales, managers, how='left', left_on=['city', 'state'], right_on=['branch', 'state'])

# Print sales_and_managers
sales_and_managers

	city	state	units	branch	branch_id	manager
0	Mendocino	CA	1	Mendocino	47.0	Brett
1	Denver	CO	4	Denver	20.0	Joel
2	Austin	TX	2	Austin	10.0	Charles
3	Springfield	MO	5	Springfield	31.0	Sally
4	Springfield	IL	1	NaN	NaN	NaN

Merging DataFrames with outer join

This exercise picks up where the previous one left off. The DataFrames revenue, managers, and sales are pre-loaded into your namespace (and, of course, pandas is imported as pd). Moreover, the merged DataFrames revenue_and_sales and sales_and_managers have been pre-computed exactly as you did in the previous exercise.

The merged DataFrames contain enough information to construct a DataFrame with 5 rows with all known information correctly aligned and each branch listed only once. You will try to merge the merged DataFrames on all matching keys (which computes an inner join by default). You can compare the result to an outer join and also to an outer join with restricted subset of columns as keys.

Instructions

Merge sales_and_managers with revenue_and_sales. Store the result as merge_default.
Print merge_default. This has been done for you.
Merge sales_and_managers with revenue_and_sales using how='outer'. Store the result as merge_outer.
Print merge_outer. This has been done for you.
Merge sales_and_managers with revenue_and_sales only on ['city','state'] using an outer join. Store the result as merge_outer_on and hit ‘Submit Answer’ to see what the merged DataFrames look like!

  
# Perform the first merge: merge_default
merge_default = pd.merge(sales_and_managers, revenue_and_sales)

# Print merge_default
merge_default

	city	state	units	branch	branch_id	manager	revenue
0	Mendocino	CA	1	Mendocino	47.0	Brett	200.0
1	Denver	CO	4	Denver	20.0	Joel	83.0
2	Austin	TX	2	Austin	10.0	Charles	100.0

  
# Perform the second merge: merge_outer
merge_outer = pd.merge(sales_and_managers, revenue_and_sales, how='outer')

# Print merge_outer
merge_outer

	city	state	units	branch	branch_id	manager	revenue
0	Austin	TX	2	Austin	10.0	Charles	100.0
1	Denver	CO	4	Denver	20.0	Joel	83.0
2	Mendocino	CA	1	Mendocino	47.0	Brett	200.0
3	Springfield	IL	1	NaN	30.0	NaN	4.0
4	Springfield	IL	1	NaN	NaN	NaN	NaN
5	Springfield	MO	5	Springfield	31.0	Sally	NaN
6	Springfield	MO	5	NaN	NaN	NaN	NaN

  
# Perform the third merge: merge_outer_on
merge_outer_on = pd.merge(sales_and_managers, revenue_and_sales, on=['city', 'state'], how='outer')

# Print merge_outer_on
merge_outer_on

	city	state	units_x	branch	branch_id_x	manager	branch_id_y	revenue	units_y
0	Austin	TX	2	Austin	10.0	Charles	10.0	100.0	2
1	Denver	CO	4	Denver	20.0	Joel	20.0	83.0	4
2	Mendocino	CA	1	Mendocino	47.0	Brett	47.0	200.0	1
3	Springfield	IL	1	NaN	NaN	NaN	30.0	4.0	1
4	Springfield	MO	5	Springfield	31.0	Sally	NaN	NaN	5

  
del rev, man, revenue, managers, students, midterm_results, grades, sale, sales, revenue_and_sales, sales_and_managers, merge_default, merge_outer, merge_outer_on

Ordered merges

Software & hardware sales

  
software = pd.read_csv(sales_feb_software_file, parse_dates=['Date']).sort_values('Date')
software.head(10)

	Date	Company	Product	Units
2	2015-02-02 08:33:01	Hooli	Software	3
1	2015-02-03 14:14:18	Initech	Software	13
7	2015-02-04 15:36:29	Streeplex	Software	13
3	2015-02-05 01:53:06	Acme Coporation	Software	19
5	2015-02-09 13:09:55	Mediacore	Software	7
4	2015-02-11 20:03:08	Initech	Software	7
6	2015-02-11 22:50:44	Hooli	Software	4
0	2015-02-16 12:09:19	Hooli	Software	10
8	2015-02-21 05:01:26	Mediacore	Software	3

  
hardware = pd.read_csv(sales_feb_hardware_file, parse_dates=['Date']).sort_values('Date')
hardware.head()

	Date	Company	Product	Units
3	2015-02-02 20:54:49	Mediacore	Hardware	9
0	2015-02-04 21:52:45	Acme Coporation	Hardware	14
1	2015-02-07 22:58:10	Acme Coporation	Hardware	1
2	2015-02-19 10:59:33	Mediacore	Hardware	16
4	2015-02-21 20:41:47	Hooli	Hardware	3

Using merge()

attempting to merge yields an empty DataFrame because it’s doing an INNER join on all columns with matching names by defaults
- ‘Units’ and ‘Date’ columns have no overlapping values, so the result is empty

  
sales_merge = pd.merge(hardware, software)
sales_merge

	Date	Company	Product	Units

  
sales_merge.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   Date     0 non-null      datetime64[ns]
 1   Company  0 non-null      object        
 2   Product  0 non-null      object        
 3   Units    0 non-null      int64         
dtypes: datetime64[ns](1), int64(1), object(2)
memory usage: 132.0+ bytes

Using merge(how=’outer’)

  
sales_merge = pd.merge(hardware, software, how='outer')
sales_merge.head(14)

	Date	Company	Product	Units
0	2015-02-02 08:33:01	Hooli	Software	3
1	2015-02-02 20:54:49	Mediacore	Hardware	9
2	2015-02-03 14:14:18	Initech	Software	13
3	2015-02-04 15:36:29	Streeplex	Software	13
4	2015-02-04 21:52:45	Acme Coporation	Hardware	14
5	2015-02-05 01:53:06	Acme Coporation	Software	19
6	2015-02-07 22:58:10	Acme Coporation	Hardware	1
7	2015-02-09 13:09:55	Mediacore	Software	7
8	2015-02-11 20:03:08	Initech	Software	7
9	2015-02-11 22:50:44	Hooli	Software	4
10	2015-02-16 12:09:19	Hooli	Software	10
11	2015-02-19 10:59:33	Mediacore	Hardware	16
12	2015-02-21 05:01:26	Mediacore	Software	3
13	2015-02-21 20:41:47	Hooli	Hardware	3

Sorting merge(how=’outer’)

  
sales_merge = pd.merge(hardware, software, how='outer').sort_values('Date')
sales_merge.head(14)

	Date	Company	Product	Units
0	2015-02-02 08:33:01	Hooli	Software	3
1	2015-02-02 20:54:49	Mediacore	Hardware	9
2	2015-02-03 14:14:18	Initech	Software	13
3	2015-02-04 15:36:29	Streeplex	Software	13
4	2015-02-04 21:52:45	Acme Coporation	Hardware	14
5	2015-02-05 01:53:06	Acme Coporation	Software	19
6	2015-02-07 22:58:10	Acme Coporation	Hardware	1
7	2015-02-09 13:09:55	Mediacore	Software	7
8	2015-02-11 20:03:08	Initech	Software	7
9	2015-02-11 22:50:44	Hooli	Software	4
10	2015-02-16 12:09:19	Hooli	Software	10
11	2015-02-19 10:59:33	Mediacore	Hardware	16
12	2015-02-21 05:01:26	Mediacore	Software	3
13	2015-02-21 20:41:47	Hooli	Hardware	3

Using merge_ordered()

the default is an OUTER join

  
sales_merged = pd.merge_ordered(hardware, software)
sales_merged.head(14)

	Date	Company	Product	Units
0	2015-02-02 08:33:01	Hooli	Software	3
1	2015-02-02 20:54:49	Mediacore	Hardware	9
2	2015-02-03 14:14:18	Initech	Software	13
3	2015-02-04 15:36:29	Streeplex	Software	13
4	2015-02-04 21:52:45	Acme Coporation	Hardware	14
5	2015-02-05 01:53:06	Acme Coporation	Software	19
6	2015-02-07 22:58:10	Acme Coporation	Hardware	1
7	2015-02-09 13:09:55	Mediacore	Software	7
8	2015-02-11 20:03:08	Initech	Software	7
9	2015-02-11 22:50:44	Hooli	Software	4
10	2015-02-16 12:09:19	Hooli	Software	10
11	2015-02-19 10:59:33	Mediacore	Hardware	16
12	2015-02-21 05:01:26	Mediacore	Software	3
13	2015-02-21 20:41:47	Hooli	Hardware	3

Using on & suffixes

  
sales_merged = pd.merge_ordered(hardware, software, on=['Date', 'Company'], suffixes=['_hardware', '_software'])
sales_merged.head()

	Date	Company	Product_hardware	Units_hardware	Product_software	Units_software
0	2015-02-02 08:33:01	Hooli	NaN	NaN	Software	3.0
1	2015-02-02 20:54:49	Mediacore	Hardware	9.0	NaN	NaN
2	2015-02-03 14:14:18	Initech	NaN	NaN	Software	13.0
3	2015-02-04 15:36:29	Streeplex	NaN	NaN	Software	13.0
4	2015-02-04 21:52:45	Acme Coporation	Hardware	14.0	NaN	NaN

Stocks data

pwd()

'D:\\users\\trenton\\Dropbox\\PythonProjects\\DataCamp'

  
stocks_dir = Path.cwd() / 'data' / 'merging-dataframes-with-pandas'

tickers = ['^gspc', 'AAPL', 'CSCO', 'AMZN', 'MSFT', 'IBM']

for tk in tickers:
    print(tk)
    df = yf.download(tk, start='1980-01-01', end='2024-04-30', interval='1d').assign(tkr=tk)
    if tk == '^gspc':
        tk = 'SP500'
    df.to_csv( stocks_dir / f'{tk}.csv', index=True)

^gspc

[*********************100%%**********************]  1 of 1 completed

AAPL

[*********************100%%**********************]  1 of 1 completed

CSCO

[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed

AMZN

MSFT

[*********************100%%**********************]  1 of 1 completed

IBM

[*********************100%%**********************]  1 of 1 completed

  
sp500_stocks = stocks_dir / 'SP500.csv'
aapl = stocks_dir / 'AAPL.csv'
csco = stocks_dir / 'CSCO.csv'
amzn = stocks_dir / 'AMZN.csv'
msft = stocks_dir / 'MSFT.csv'
ibm = stocks_dir / 'IBM.csv'

  
sp500_df = pd.read_csv(sp500_stocks, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
aapl_df = pd.read_csv(aapl, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
csco_df = pd.read_csv(csco, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
amzn_df = pd.read_csv(amzn, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
msft_df = pd.read_csv(msft, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
ibm_df = pd.read_csv(ibm, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])

  
sp500_df.rename(columns={'Close': 'S&P'}, inplace=True)
aapl_df.rename(columns={'Close': 'AAPL'}, inplace=True)
csco_df.rename(columns={'Close': 'CSCO'}, inplace=True)
amzn_df.rename(columns={'Close': 'AMZN'}, inplace=True)
msft_df.rename(columns={'Close': 'MSFT'}, inplace=True)
ibm_df.rename(columns={'Close': 'IBM'}, inplace=True)

sp500_df

	S&P
Date
1980-01-02	105.760002
1980-01-03	105.220001
1980-01-04	106.519997
1980-01-07	106.809998
1980-01-08	108.949997
...	...
2024-04-23	5070.549805
2024-04-24	5071.629883
2024-04-25	5048.419922
2024-04-26	5099.959961
2024-04-29	5116.169922

11175 rows × 1 columns

  
stocks = pd.concat([sp500_df, aapl_df, csco_df, amzn_df, msft_df, ibm_df], axis=1)

  
stocks.head()

	S&P	AAPL	CSCO	AMZN	MSFT	IBM
Date
1980-01-02	105.760002	NaN	NaN	NaN	NaN	14.937859
1980-01-03	105.220001	NaN	NaN	NaN	NaN	15.176864
1980-01-04	106.519997	NaN	NaN	NaN	NaN	15.146989
1980-01-07	106.809998	NaN	NaN	NaN	NaN	15.087237
1980-01-08	108.949997	NaN	NaN	NaN	NaN	16.103010

  
stocks.tail()

	S&P	AAPL	CSCO	AMZN	MSFT	IBM
Date
2024-04-23	5070.549805	166.899994	48.320000	179.539993	407.570007	182.190002
2024-04-24	5071.629883	169.020004	48.349998	176.589996	409.059998	184.100006
2024-04-25	5048.419922	169.889999	48.099998	173.669998	399.040009	168.910004
2024-04-26	5099.959961	169.300003	47.860001	179.619995	406.320007	167.130005
2024-04-29	5116.169922	173.500000	47.779999	180.960007	402.250000	167.429993

  
stocks.to_csv(stocks_dir / 'stocks.csv', index=True, index_label='Date')

GDP data

  
gdp = pd.read_csv(gdp_usa_file, parse_dates=['DATE'])
gdp.sort_values(by=['DATE'], ascending=False, inplace=True)
gdp.reset_index(inplace=True, drop=True)
gdp.rename(columns={'VALUE': 'GDP', 'DATE': 'Date'}, inplace=True)
gdp.head(8)

	Date	GDP
0	2016-04-01	18436.5
1	2016-01-01	18281.6
2	2015-10-01	18222.8
3	2015-07-01	18141.9
4	2015-04-01	17998.3
5	2015-01-01	17783.6
6	2014-10-01	17692.2
7	2014-07-01	17569.4

Ordered merge

  
gdp_2000_2015 = gdp[(gdp['Date'].dt.year >= 2000) & (gdp['Date'].dt.year <= 2015)]

  
stocks.reset_index(inplace=True)
stocks.head(5)

	Date	S&P	AAPL	CSCO	AMZN	MSFT	IBM
0	1980-01-02	105.760002	NaN	NaN	NaN	NaN	14.937859
1	1980-01-03	105.220001	NaN	NaN	NaN	NaN	15.176864
2	1980-01-04	106.519997	NaN	NaN	NaN	NaN	15.146989
3	1980-01-07	106.809998	NaN	NaN	NaN	NaN	15.087237
4	1980-01-08	108.949997	NaN	NaN	NaN	NaN	16.103010

  
stocks_2000_2015 = stocks[(stocks['Date'].dt.year >= 2000) & (stocks['Date'].dt.year <= 2015)]

  
ordered_df = pd.merge_ordered(stocks_2000_2015, gdp_2000_2015, on='Date')
ordered_df.head()

	Date	S&P	AAPL	CSCO	AMZN	MSFT	IBM	GDP
0	2000-01-01	NaN	NaN	NaN	NaN	NaN	NaN	10031.0
1	2000-01-03	1455.219971	0.999442	54.03125	4.468750	58.28125	110.898659	NaN
2	2000-01-04	1399.420044	0.915179	51.00000	4.096875	56.31250	107.134323	NaN
3	2000-01-05	1402.109985	0.928571	50.84375	3.487500	56.90625	110.898659	NaN
4	2000-01-06	1403.449951	0.848214	50.00000	3.278125	55.00000	108.986618	NaN

Ordered merge with ffill

  
ordered_df = pd.merge_ordered(stocks_2000_2015, gdp_2000_2015, on='Date', fill_method='ffill')
ordered_df.head()

	Date	S&P	AAPL	CSCO	AMZN	MSFT	IBM	GDP
0	2000-01-01	NaN	NaN	NaN	NaN	NaN	NaN	10031.0
1	2000-01-03	1455.219971	0.999442	54.03125	4.468750	58.28125	110.898659	10031.0
2	2000-01-04	1399.420044	0.915179	51.00000	4.096875	56.31250	107.134323	10031.0
3	2000-01-05	1402.109985	0.928571	50.84375	3.487500	56.90625	110.898659	10031.0
4	2000-01-06	1403.449951	0.848214	50.00000	3.278125	55.00000	108.986618	10031.0

  
del software, hardware, sales_merge, sales_merged, stocks_dir, sp500_stocks, aapl
del csco, amzn, msft, ibm, sp500_df, aapl_df, csco_df, amzn_df, msft_df, ibm_df, stocks
del gdp, gdp_2000_2015, stocks_2000_2015, ordered_df

Exercises

Using merge_ordered()

This exercise uses pre-loaded DataFrames austin and houston that contain weather data from the cities Austin and Houston respectively. They have been printed in the IPython Shell for you to examine.

Weather conditions were recorded on separate days and you need to merge these two DataFrames together such that the dates are ordered. To do this, you’ll use pd.merge_ordered(). After you’re done, note the order of the rows before and after merging.

Instructions

Perform an ordered merge on austin and houston using pd.merge_ordered(). Store the result as tx_weather.
Print tx_weather. You should notice that the rows are sorted by the date but it is not possible to tell which observation came from which city.
Perform another ordered merge on austin and houston.
- This time, specify the keyword arguments on='date' and suffixes=['_aus','_hus'] so that the rows can be distinguished. Store the result as tx_weather_suff.
Print tx_weather_suff to examine its contents. This has been done for you.
Perform a third ordered merge on austin and houston.
- This time, in addition to the on and suffixes parameters, specify the keyword argument fill_method='ffill' to use forward-filling to replace NaN entries with the most recent non-null entry, and hit ‘Submit Answer’ to examine the contents of the merged DataFrames!

  
austin = pd.DataFrame.from_dict({'date': ['2016-01-01', '2016-02-08', '2016-01-17'], 'ratings': ['Cloudy', 'Cloudy', 'Sunny']})
houston = pd.DataFrame.from_dict({'date': ['2016-01-04', '2016-01-01', '2016-03-01'], 'ratings': ['Rainy', 'Cloudy', 'Sunny']})

  
# Perform the first ordered merge: tx_weather
tx_weather = pd.merge_ordered(austin, houston)

# Print tx_weather
tx_weather

	date	ratings
0	2016-01-01	Cloudy
1	2016-01-04	Rainy
2	2016-01-17	Sunny
3	2016-02-08	Cloudy
4	2016-03-01	Sunny

  
# Perform the second ordered merge: tx_weather_suff
tx_weather_suff = pd.merge_ordered(austin, houston, on='date', suffixes=['_aus','_hus'])

# Print tx_weather_suff
tx_weather_suff

	date	ratings_aus	ratings_hus
0	2016-01-01	Cloudy	Cloudy
1	2016-01-04	NaN	Rainy
2	2016-01-17	Sunny	NaN
3	2016-02-08	Cloudy	NaN
4	2016-03-01	NaN	Sunny

  
# Perform the third ordered merge: tx_weather_ffill
tx_weather_ffill = pd.merge_ordered(austin, houston, on='date', suffixes=['_aus','_hus'], fill_method='ffill')

# Print tx_weather_ffill
tx_weather_ffill

	date	ratings_aus	ratings_hus
0	2016-01-01	Cloudy	Cloudy
1	2016-01-04	Cloudy	Rainy
2	2016-01-17	Sunny	Rainy
3	2016-02-08	Cloudy	Rainy
4	2016-03-01	Cloudy	Sunny

  
del austin, houston, tx_weather, tx_weather_suff, tx_weather_ffill

Using merge_asof()

Similar to pd.merge_ordered(), the pd.merge_asof() function will also merge values in order using the on column, but for each row in the left DataFrame, only rows from the right DataFrame whose 'on' column values are less than the left value will be kept.

This function can be used to align disparate datetime frequencies without having to first resample.

Here, you’ll merge monthly oil prices (US dollars) into a full automobile fuel efficiency dataset. The oil and automobile DataFrames have been pre-loaded as oil and auto. The first 5 rows of each have been printed in the IPython Shell for you to explore.

These datasets will align such that the first price of the year will be broadcast into the rows of the automobiles DataFrame. This is considered correct since by the start of any given year, most automobiles for that year will have already been manufactured.

You’ll then inspect the merged DataFrame, resample by year and compute the mean 'Price' and 'mpg'. You should be able to see a trend in these two columns, that you can confirm by computing the Pearson correlation between resampled 'Price' and 'mpg'.

Instructions

Merge auto and oil using pd.merge_asof() with left_on='yr' and ight_on='Date'. Store the result as merged.
Print the tail of merged. This has been done for you.
Resample merged using 'A' (annual frequency), and on='Date'. Select [['mpg','Price']] and aggregate the mean. Store the result as yearly.
Hit Submit Answer to examine the contents of yearly and yearly.corr(), which shows the Pearson correlation between the resampled 'Price' and 'mpg'.

  
oil = pd.read_csv(oil_price_file, parse_dates=['Date'])
auto = pd.read_csv(auto_fuel_file, parse_dates=['yr'])

  
oil.head()

	Date	Price
0	1970-01-01	3.35
1	1970-02-01	3.35
2	1970-03-01	3.35
3	1970-04-01	3.35
4	1970-05-01	3.35

  
auto.head()

	mpg	cyl	displ	hp	weight	accel	yr	origin	name
0	18.0	8	307.0	130	3504	12.0	1970-01-01	US	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	1970-01-01	US	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	1970-01-01	US	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	1970-01-01	US	amc rebel sst
4	17.0	8	302.0	140	3449	10.5	1970-01-01	US	ford torino

  
# Merge auto and oil: merged
merged = pd.merge_asof(auto, oil, left_on='yr', right_on='Date')

# Print the tail of merged
merged.tail()

	mpg	cyl	displ	hp	weight	accel	yr	origin	name	Date	Price
387	27.0	4	140.0	86	2790	15.6	1982-01-01	US	ford mustang gl	1982-01-01	33.85
388	44.0	4	97.0	52	2130	24.6	1982-01-01	Europe	vw pickup	1982-01-01	33.85
389	32.0	4	135.0	84	2295	11.6	1982-01-01	US	dodge rampage	1982-01-01	33.85
390	28.0	4	120.0	79	2625	18.6	1982-01-01	US	ford ranger	1982-01-01	33.85
391	31.0	4	119.0	82	2720	19.4	1982-01-01	US	chevy s-10	1982-01-01	33.85

  
# Resample merged: yearly
yearly = merged.resample('YE', on='Date')[['mpg','Price']].mean()

# Print yearly
yearly

	mpg	Price
Date
1970-12-31	17.689655	3.35
1971-12-31	21.111111	3.56
1972-12-31	18.714286	3.56
1973-12-31	17.100000	3.56
1974-12-31	22.769231	10.11
1975-12-31	20.266667	11.16
1976-12-31	21.573529	11.16
1977-12-31	23.375000	13.90
1978-12-31	24.061111	14.85
1979-12-31	25.093103	14.85
1980-12-31	33.803704	32.50
1981-12-31	30.185714	38.00
1982-12-31	32.000000	33.85

  
# print yearly.corr()
yearly.corr()

	mpg	Price
mpg	1.000000	0.948677
Price	0.948677	1.000000

Case Study - Summer Olympics

To cement your new skills, you’ll apply them by working on an in-depth study involving Olympic medal data. The analysis involves integrating your multi-DataFrame skills from this course and also skills you’ve gained in previous pandas courses. This is a rich dataset that will allow you to fully leverage your pandas data manipulation skills. Enjoy!

Medals in the Summer Olympics

Summer Olympic medalists 1896 to 2008 - IOC COUNTRY CODES.csv

  
pd.read_csv(so_ioc_codes_file).head(8)

	Country	NOC	ISO code
0	Afghanistan	AFG	AF
1	Albania	ALB	AL
2	Algeria	ALG	DZ
3	American Samoa*	ASA	AS
4	Andorra	AND	AD
5	Angola	ANG	AO
6	Antigua and Barbuda	ANT	AG
7	Argentina	ARG	AR

Summer Olympic medalists 1896 to 2008 - EDITIONS.tsv

  
pd.read_csv(so_editions_file, sep='\t').head(8)

	Edition	Bronze	Gold	Silver	Grand Total	City	Country
0	1896	40	64	47	151	Athens	Greece
1	1900	142	178	192	512	Paris	France
2	1904	123	188	159	470	St. Louis	United States
3	1908	211	311	282	804	London	United Kingdom
4	1912	284	301	300	885	Stockholm	Sweden
5	1920	355	497	446	1298	Antwerp	Belgium
6	1924	285	301	298	884	Paris	France
7	1928	242	229	239	710	Amsterdam	Netherlands

summer_1896.csv, summer_1900.csv, …, summer_2008.csv

  
pd.read_csv(so_all_medalists_file, sep='\t', header=4).head(8)

	City	Edition	Sport	Discipline	Athlete	NOC	Gender	Event	Event_gender	Medal
0	Athens	1896	Aquatics	Swimming	HAJOS, Alfred	HUN	Men	100m freestyle	M	Gold
1	Athens	1896	Aquatics	Swimming	HERSCHMANN, Otto	AUT	Men	100m freestyle	M	Silver
2	Athens	1896	Aquatics	Swimming	DRIVAS, Dimitrios	GRE	Men	100m freestyle for sailors	M	Bronze
3	Athens	1896	Aquatics	Swimming	MALOKINIS, Ioannis	GRE	Men	100m freestyle for sailors	M	Gold
4	Athens	1896	Aquatics	Swimming	CHASAPIS, Spiridon	GRE	Men	100m freestyle for sailors	M	Silver
5	Athens	1896	Aquatics	Swimming	CHOROPHAS, Efstathios	GRE	Men	1200m freestyle	M	Bronze
6	Athens	1896	Aquatics	Swimming	HAJOS, Alfred	HUN	Men	1200m freestyle	M	Gold
7	Athens	1896	Aquatics	Swimming	ANDREOU, Joannis	GRE	Men	1200m freestyle	M	Silver

Reminder: loading & merging files

pd.read_csv() (& its many options)
Looping over files, e.g.,
- [pd.read_csv(f) for f in glob(‘*.csv’)]
Concatenating & appending, e.g.,
- pd.concat([df1, df2], axis=0)
- df1.append(df2)

Case Study Explorations

Loading Olympic edition DataFrame

In this chapter, you’ll be using The Guardian’s Olympic medal dataset.

Your first task here is to prepare a DataFrame editions from a tab-separated values (TSV) file.

Initially, editions has 26 rows (one for each Olympic edition, i.e., a year in which the Olympics was held) and 7 columns: 'Edition', 'Bronze', 'Gold', 'Silver', 'Grand Total', 'City', and 'Country'.

For the analysis that follows, you won’t need the overall medal counts, so you want to keep only the useful columns from editions: 'Edition', 'Grand Total', City, and Country.

Instructions

Read file_path into a DataFrame called editions. The identifier file_path has been pre-defined with the filename 'Summer Olympic medallists 1896 to 2008 - EDITIONS.tsv'. You’ll have to use the option sep='\t' because the file uses tabs to delimit fields (pd.read_csv() expects commas by default).
Select only the columns 'Edition', 'Grand Total', 'City', and 'Country' from editions.
Print the final DataFrame editions in entirety (there are only 26 rows). This has been done for you, so hit ‘Submit Answer’ to see the result!

  
editions = pd.read_csv(so_editions_file, sep='\t')
editions = editions[['Edition', 'Grand Total', 'City', 'Country']]
editions.head()

	Edition	Grand Total	City	Country
0	1896	151	Athens	Greece
1	1900	512	Paris	France
2	1904	470	St. Louis	United States
3	1908	804	London	United Kingdom
4	1912	885	Stockholm	Sweden

Loading IOC codes DataFrames

Your task here is to prepare a DataFrame ioc_codes from a comma-separated values (CSV) file.

Initially, ioc_codes has 200 rows (one for each country) and 3 columns: 'Country', 'NOC', & 'ISO code'.

For the analysis that follows, you want to keep only the useful columns from ioc_codes: 'Country' and 'NOC' (the column 'NOC' contains three-letter codes representing each country).

Instructions

Read file_path into a DataFrame called ioc_codes. The identifier file_path has been pre-defined with the filename 'Summer Olympic medallists 1896 to 2008 - IOC COUNTRY CODES.csv'.
Select only the columns 'Country' and 'NOC' from ioc_codes.
Print the leading 5 and trailing 5 rows of the DataFrame ioc_codes (there are 200 rows in total). This has been done for you, so hit ‘Submit Answer’ to see the result!

  
ioc_codes = pd.read_csv(so_ioc_codes_file)
ioc_codes = ioc_codes[['Country', 'NOC']]
ioc_codes.head()

	Country	NOC
0	Afghanistan	AFG
1	Albania	ALB
2	Algeria	ALG
3	American Samoa*	ASA
4	Andorra	AND

Building medals DataFrame

Here, you’ll start with the DataFrame editions from the previous exercise.

You have a sequence of files summer_1896.csv, summer_1900.csv, …, summer_2008.csv, one for each Olympic edition (year).

You will build up a dictionary medals_dict with the Olympic editions (years) as keys and DataFrames as values.

The dictionary is built up inside a loop over the year of each Olympic edition (from the Index of editions).

Once the dictionary of DataFrames is built up, you will combine the DataFrames using pd.concat().

Instructions

Within the for loop:
- Create the file path. This has been done for you.
- Read file_path into a DataFrame. Assign the result to the year key of medals_dict.
- Select only the columns ‘Athlete’, ‘NOC’, and ‘Medal’ from medals_dict[year].
- Create a new column called ‘Edition’ in the DataFrame medals_dict[year] whose entries are all year.
Concatenate the dictionary of DataFrames medals_dict into a DataFame called medals. Specify the keyword argument ignore_index=True to prevent repeated integer indices.
Print the first and last 5 rows of medals. This has been done for you, so hit ‘Submit Answer’ to see the result!
Following is the code used to combine all of the editions by year
- the individual files are not available
- the combined dataset is provided

  
for year in editions['Edition']:

    # Create the file path: file_path
    file_path = 'summer_{:d}.csv'.format(year)
    
    # Load file_path into a DataFrame: medals_dict[year]
    medals_dict[year] = pd.read_csv(file_path)
    
    # Extract relevant columns: medals_dict[year]
    medals_dict[year] = medals_dict[year][['Athlete', 'NOC', 'Medal']]
    
    # Assign year to column 'Edition' of medals_dict
    medals_dict[year]['Edition'] = year
    
# Concatenate medals_dict: medals
medals = pd.concat(medals_dict, ignore_index=True)

  
medals = pd.read_csv(so_all_medalists_file, sep='\t', header=4)
medals = medals[['Athlete', 'NOC', 'Medal', 'Edition']]
medals.head()

	Athlete	NOC	Medal	Edition
0	HAJOS, Alfred	HUN	Gold	1896
1	HERSCHMANN, Otto	AUT	Silver	1896
2	DRIVAS, Dimitrios	GRE	Bronze	1896
3	MALOKINIS, Ioannis	GRE	Gold	1896
4	CHASAPIS, Spiridon	GRE	Silver	1896

Quantifying Performance

Constructing a pivot table

Apply DataFrame pivot_table() method
- index: column to use as index of pivot table
- values: column(s) to aggregate
- aggfunc: function to apply for aggregation
- columns: categories as columns of pivot table

Case Study Explorations

Counting medals by country/edition in a pivot table

Here, you’ll start with the concatenated DataFrame medals from the previous exercise.

You can construct a pivot table to see the number of medals each country won in each year. The result is a new DataFrame with the Olympic edition on the Index and with 138 country NOC codes as columns. If you want a refresher on pivot tables, it may be useful to refer back to the relevant exercises in Manipulating DataFrames with pandas.

Instructions

Construct a pivot table from the DataFrame medals, aggregating by count (by specifying the aggfunc parameter). Use 'Edition' as the index, 'Athlete' for the values, and 'NOC' for the columns.
Print the first & last 5 rows of medal_counts. This has been done for you, so hit ‘Submit Answer’ to see the results!

  
# Construct the pivot_table: medal_counts
medal_counts = medals.pivot_table(index='Edition', columns='NOC', values='Athlete', aggfunc='count')

# Print the first & last 5 rows of medal_counts
medal_counts.head()

NOC	AFG	AHO	ALG	ANZ	ARG	ARM	AUS	AUT	AZE	BAH	BAR	BDI	BEL	BER	BLR	BOH	BRA	BUL	BWI	CAN	CHI	CHN	CIV	CMR	COL	CRC	CRO	CUB	CZE	DEN	DJI	DOM	ECU	EGY	ERI	ESP	EST	ETH	EUA	EUN	FIN	FRA	FRG	GBR	GDR	GEO	GER	GHA	GRE	GUY	HAI	HKG	HUN	INA	IND	IOP	IRI	IRL	IRQ	ISL	ISR	ISV	ITA	JAM	JPN	KAZ	KEN	KGZ	KOR	KSA	KUW	LAT	LIB	LTU	LUX	MAR	MAS	MDA	MEX	MGL	MKD	MOZ	MRI	NAM	NED	NGR	NIG	NOR	NZL	PAK	PAN	PAR	PER	PHI	POL	POR	PRK	PUR	QAT	ROU	RSA	RU1	RUS	SCG	SEN	SIN	SLO	SRB	SRI	SUD	SUI	SUR	SVK	SWE	SYR	TAN	TCH	TGA	THA	TJK	TOG	TPE	TRI	TUN	TUR	UAE	UGA	UKR	URS	URU	USA	UZB	VEN	VIE	YUG	ZAM	ZIM	ZZX
Edition
1896	NaN	NaN	NaN	NaN	NaN	NaN	2.0	5.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	6.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	11.0	NaN	7.0	NaN	NaN	33.0	NaN	52.0	NaN	NaN	NaN	6.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	20.0	NaN	NaN	NaN	NaN	NaN	NaN	6.0
1900	NaN	NaN	NaN	NaN	NaN	NaN	5.0	6.0	NaN	NaN	NaN	NaN	39.0	NaN	NaN	2.0	NaN	NaN	NaN	2.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.0	NaN	6.0	NaN	NaN	NaN	NaN	NaN	2.0	NaN	NaN	NaN	NaN	NaN	185.0	NaN	78.0	NaN	NaN	40.0	NaN	NaN	NaN	NaN	NaN	5.0	NaN	2.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	20.0	NaN	NaN	9.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	55.0	NaN	NaN	NaN	NaN	NaN	NaN	34.0
1904	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	35.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.0	NaN	NaN	13.0	NaN	2.0	NaN	NaN	NaN	4.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	394.0	NaN	NaN	NaN	NaN	NaN	NaN	8.0
1908	NaN	NaN	NaN	19.0	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	31.0	NaN	NaN	5.0	NaN	NaN	NaN	51.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	29.0	35.0	NaN	347.0	NaN	NaN	22.0	NaN	3.0	NaN	NaN	NaN	18.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	7.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	11.0	NaN	NaN	44.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.0	3.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	98.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	63.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1912	NaN	NaN	NaN	10.0	NaN	NaN	NaN	14.0	NaN	NaN	NaN	NaN	19.0	NaN	NaN	NaN	NaN	NaN	NaN	8.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	84.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	67.0	25.0	NaN	160.0	NaN	NaN	52.0	NaN	2.0	NaN	NaN	NaN	30.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	21.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	22.0	NaN	NaN	76.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	7.0	14.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	173.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	101.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN

  
medal_counts.tail()

NOC	AFG	AHO	ALG	ANZ	ARG	ARM	AUS	AUT	AZE	BAH	BAR	BDI	BEL	BER	BLR	BOH	BRA	BUL	BWI	CAN	CHI	CHN	CIV	CMR	COL	CRC	CRO	CUB	CZE	DEN	DJI	DOM	ECU	EGY	ERI	ESP	EST	ETH	EUA	EUN	FIN	FRA	FRG	GBR	GDR	GEO	GER	GHA	GRE	GUY	HAI	HKG	HUN	INA	IND	IOP	IRI	IRL	IRQ	ISL	ISR	ISV	ITA	JAM	JPN	KAZ	KEN	KGZ	KOR	KSA	KUW	LAT	LIB	LTU	LUX	MAR	MAS	MDA	MEX	MGL	MKD	MOZ	MRI	NAM	NED	NGR	NIG	NOR	NZL	PAK	PAN	PAR	PER	PHI	POL	POR	PRK	PUR	QAT	ROU	RSA	RU1	RUS	SCG	SEN	SIN	SLO	SRB	SRI	SUD	SUI	SUR	SVK	SWE	SYR	TAN	TCH	TGA	THA	TJK	TOG	TPE	TRI	TUN	TUR	UAE	UGA	UKR	URS	URU	USA	UZB	VEN	VIE	YUG	ZAM	ZIM	ZZX
Edition
1992	NaN	NaN	2.0	NaN	2.0	NaN	57.0	6.0	NaN	1.0	NaN	NaN	3.0	NaN	NaN	NaN	14.0	17.0	NaN	44.0	NaN	83.0	NaN	NaN	1.0	NaN	15.0	71.0	NaN	14.0	NaN	NaN	NaN	NaN	NaN	66.0	3.0	3.0	NaN	223.0	7.0	57.0	NaN	50.0	NaN	NaN	198.0	13.0	2.0	NaN	NaN	NaN	45.0	6.0	NaN	3.0	3.0	2.0	NaN	NaN	2.0	NaN	46.0	4.0	47.0	NaN	8.0	NaN	49.0	NaN	NaN	3.0	NaN	13.0	NaN	3.0	2.0	NaN	1.0	2.0	NaN	NaN	NaN	2.0	33.0	11.0	NaN	23.0	15.0	16.0	NaN	NaN	1.0	1.0	42.0	NaN	10.0	1.0	1.0	53.0	3.0	NaN	NaN	NaN	NaN	NaN	6.0	NaN	NaN	NaN	1.0	1.0	NaN	35.0	NaN	NaN	8.0	NaN	1.0	NaN	NaN	20.0	NaN	NaN	6.0	NaN	NaN	NaN	NaN	NaN	224.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1996	NaN	NaN	3.0	NaN	20.0	2.0	132.0	3.0	1.0	5.0	NaN	1.0	6.0	NaN	23.0	NaN	64.0	21.0	NaN	51.0	NaN	110.0	NaN	NaN	NaN	1.0	29.0	57.0	13.0	24.0	NaN	NaN	1.0	NaN	NaN	67.0	NaN	3.0	NaN	NaN	4.0	51.0	NaN	26.0	NaN	2.0	124.0	NaN	8.0	NaN	NaN	1.0	43.0	6.0	1.0	NaN	3.0	4.0	NaN	NaN	1.0	NaN	71.0	16.0	43.0	11.0	8.0	NaN	66.0	NaN	NaN	1.0	NaN	12.0	NaN	2.0	3.0	3.0	1.0	1.0	NaN	1.0	NaN	2.0	73.0	26.0	NaN	25.0	9.0	NaN	NaN	NaN	NaN	1.0	21.0	3.0	5.0	1.0	NaN	38.0	5.0	NaN	115.0	NaN	NaN	NaN	2.0	NaN	NaN	NaN	11.0	NaN	3.0	31.0	1.0	NaN	NaN	1.0	2.0	NaN	NaN	1.0	2.0	1.0	6.0	NaN	1.0	34.0	NaN	NaN	260.0	2.0	NaN	NaN	26.0	1.0	NaN	NaN
2000	NaN	NaN	5.0	NaN	20.0	1.0	183.0	4.0	3.0	6.0	1.0	NaN	7.0	NaN	22.0	NaN	48.0	13.0	NaN	31.0	18.0	79.0	NaN	18.0	1.0	2.0	10.0	69.0	9.0	25.0	NaN	NaN	NaN	NaN	NaN	43.0	3.0	8.0	NaN	NaN	5.0	66.0	NaN	55.0	NaN	6.0	119.0	NaN	18.0	NaN	NaN	NaN	53.0	8.0	1.0	NaN	4.0	1.0	NaN	1.0	1.0	NaN	65.0	23.0	43.0	7.0	7.0	1.0	73.0	2.0	1.0	3.0	NaN	17.0	NaN	5.0	NaN	2.0	6.0	NaN	1.0	1.0	NaN	NaN	79.0	8.0	NaN	44.0	4.0	NaN	NaN	NaN	NaN	NaN	24.0	2.0	4.0	NaN	1.0	46.0	5.0	NaN	188.0	NaN	NaN	NaN	3.0	NaN	1.0	NaN	14.0	NaN	6.0	32.0	NaN	NaN	NaN	NaN	3.0	NaN	NaN	5.0	2.0	NaN	5.0	NaN	NaN	35.0	NaN	1.0	248.0	4.0	NaN	1.0	26.0	NaN	NaN	NaN
2004	NaN	NaN	NaN	NaN	47.0	NaN	157.0	8.0	5.0	2.0	NaN	NaN	3.0	NaN	17.0	NaN	40.0	17.0	NaN	17.0	4.0	94.0	NaN	1.0	2.0	NaN	20.0	61.0	12.0	29.0	NaN	1.0	NaN	5.0	1.0	27.0	3.0	7.0	NaN	NaN	2.0	53.0	NaN	57.0	NaN	4.0	149.0	NaN	31.0	NaN	NaN	2.0	40.0	5.0	1.0	NaN	6.0	NaN	NaN	NaN	2.0	NaN	102.0	13.0	94.0	8.0	7.0	NaN	52.0	NaN	NaN	4.0	NaN	3.0	NaN	3.0	NaN	NaN	4.0	1.0	NaN	NaN	NaN	NaN	76.0	8.0	NaN	7.0	6.0	NaN	NaN	17.0	NaN	NaN	12.0	3.0	5.0	NaN	NaN	39.0	10.0	NaN	192.0	14.0	NaN	NaN	5.0	NaN	NaN	NaN	7.0	NaN	10.0	12.0	1.0	NaN	NaN	NaN	8.0	NaN	NaN	9.0	1.0	NaN	10.0	1.0	NaN	48.0	NaN	NaN	264.0	5.0	2.0	NaN	NaN	NaN	3.0	NaN
2008	1.0	NaN	2.0	NaN	51.0	6.0	149.0	3.0	7.0	5.0	NaN	NaN	5.0	NaN	30.0	NaN	75.0	5.0	NaN	34.0	1.0	184.0	NaN	1.0	2.0	NaN	5.0	47.0	7.0	18.0	NaN	2.0	1.0	1.0	NaN	71.0	3.0	7.0	NaN	NaN	5.0	76.0	NaN	77.0	NaN	6.0	101.0	NaN	7.0	NaN	NaN	NaN	27.0	7.0	3.0	NaN	2.0	3.0	NaN	14.0	1.0	NaN	42.0	17.0	51.0	13.0	14.0	2.0	78.0	NaN	NaN	3.0	NaN	5.0	NaN	2.0	1.0	1.0	4.0	4.0	NaN	NaN	1.0	NaN	62.0	24.0	NaN	22.0	14.0	NaN	1.0	NaN	NaN	NaN	20.0	2.0	6.0	NaN	NaN	22.0	1.0	NaN	143.0	NaN	NaN	3.0	5.0	15.0	NaN	1.0	11.0	NaN	10.0	7.0	NaN	NaN	NaN	NaN	4.0	2.0	1.0	4.0	5.0	1.0	8.0	NaN	NaN	31.0	NaN	NaN	315.0	6.0	1.0	1.0	NaN	NaN	4.0	NaN

Computing fraction of medals per Olympic edition

In this exercise, you’ll start with the DataFrames editions, medals, & medal_counts from prior exercises.

You can extract a Series with the total number of medals awarded in each Olympic edition.

The DataFrame medal_counts can be divided row-wise by the total number of medals awarded each edition; the method .divide() performs the broadcast as you require.

This gives you a normalized indication of each country’s performance in each edition.

Instructions

Set the index of the DataFrame editions to be 'Edition' (using the method .set_index()). Save the result as totals.
Extract the 'Grand Total' column from totals and assign the result back to totals.
Divide the DataFrame medal_counts by totals along each row. You will have to use the .divide() method with the option axis='rows'. Assign the result to fractions.
Print first & last 5 rows of the DataFrame fractions. This has been done for you, so hit ‘Submit Answer’ to see the results!

  
# Set Index of editions: totals
totals = editions.set_index('Edition')
totals.head()

	Grand Total	City	Country
Edition
1896	151	Athens	Greece
1900	512	Paris	France
1904	470	St. Louis	United States
1908	804	London	United Kingdom
1912	885	Stockholm	Sweden

  
# Reassign totals['Grand Total']: totals
totals = totals['Grand Total']
totals.head()

Edition
  151
  512
  470
  804
  885
Name: Grand Total, dtype: int64

  
# Divide medal_counts by totals: fractions
fractions = medal_counts.divide(totals, axis='rows')

# Print first & last 5 rows of fractions
fractions.head()

NOC	AFG	AHO	ALG	ANZ	ARG	ARM	AUS	AUT	AZE	BAH	BAR	BDI	BEL	BER	BLR	BOH	BRA	BUL	BWI	CAN	CHI	CHN	CIV	CMR	COL	CRC	CRO	CUB	CZE	DEN	DJI	DOM	ECU	EGY	ERI	ESP	EST	ETH	EUA	EUN	FIN	FRA	FRG	GBR	GDR	GEO	GER	GHA	GRE	GUY	HAI	HKG	HUN	INA	IND	IOP	IRI	IRL	IRQ	ISL	ISR	ISV	ITA	JAM	JPN	KAZ	KEN	KGZ	KOR	KSA	KUW	LAT	LIB	LTU	LUX	MAR	MAS	MDA	MEX	MGL	MKD	MOZ	MRI	NAM	NED	NGR	NIG	NOR	NZL	PAK	PAN	PAR	PER	PHI	POL	POR	PRK	PUR	QAT	ROU	RSA	RU1	RUS	SCG	SEN	SIN	SLO	SRB	SRI	SUD	SUI	SUR	SVK	SWE	SYR	TAN	TCH	TGA	THA	TJK	TOG	TPE	TRI	TUN	TUR	UAE	UGA	UKR	URS	URU	USA	UZB	VEN	VIE	YUG	ZAM	ZIM	ZZX
Edition
1896	NaN	NaN	NaN	NaN	NaN	NaN	0.013245	0.033113	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.039735	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.072848	NaN	0.046358	NaN	NaN	0.218543	NaN	0.344371	NaN	NaN	NaN	0.039735	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.019868	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.132450	NaN	NaN	NaN	NaN	NaN	NaN	0.039735
1900	NaN	NaN	NaN	NaN	NaN	NaN	0.009766	0.011719	NaN	NaN	NaN	NaN	0.076172	NaN	NaN	0.003906	NaN	NaN	NaN	0.003906	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.003906	NaN	0.011719	NaN	NaN	NaN	NaN	NaN	0.003906	NaN	NaN	NaN	NaN	NaN	0.361328	NaN	0.152344	NaN	NaN	0.078125	NaN	NaN	NaN	NaN	NaN	0.009766	NaN	0.003906	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.007812	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.039062	NaN	NaN	0.017578	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.029297	NaN	NaN	0.001953	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.107422	NaN	NaN	NaN	NaN	NaN	NaN	0.066406
1904	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.002128	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.074468	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.019149	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.004255	NaN	NaN	0.027660	NaN	0.004255	NaN	NaN	NaN	0.008511	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.004255	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.838298	NaN	NaN	NaN	NaN	NaN	NaN	0.017021
1908	NaN	NaN	NaN	0.023632	NaN	NaN	NaN	0.001244	NaN	NaN	NaN	NaN	0.038557	NaN	NaN	0.006219	NaN	NaN	NaN	0.063433	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.018657	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.036070	0.043532	NaN	0.431592	NaN	NaN	0.027363	NaN	0.003731	NaN	NaN	NaN	0.022388	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.008706	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.013682	NaN	NaN	0.054726	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.002488	0.003731	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.121891	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.078358	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1912	NaN	NaN	NaN	0.011299	NaN	NaN	NaN	0.015819	NaN	NaN	NaN	NaN	0.021469	NaN	NaN	NaN	NaN	NaN	NaN	0.009040	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.094915	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.075706	0.028249	NaN	0.180791	NaN	NaN	0.058757	NaN	0.002260	NaN	NaN	NaN	0.033898	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.023729	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.024859	NaN	NaN	0.085876	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.007910	0.015819	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.195480	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.114124	NaN	NaN	NaN	NaN	NaN	NaN	NaN

  
fractions.tail()

NOC	AFG	AHO	ALG	ANZ	ARG	ARM	AUS	AUT	AZE	BAH	BAR	BDI	BEL	BER	BLR	BOH	BRA	BUL	BWI	CAN	CHI	CHN	CIV	CMR	COL	CRC	CRO	CUB	CZE	DEN	DJI	DOM	ECU	EGY	ERI	ESP	EST	ETH	EUA	EUN	FIN	FRA	FRG	GBR	GDR	GEO	GER	GHA	GRE	GUY	HAI	HKG	HUN	INA	IND	IOP	IRI	IRL	IRQ	ISL	ISR	ISV	ITA	JAM	JPN	KAZ	KEN	KGZ	KOR	KSA	KUW	LAT	LIB	LTU	LUX	MAR	MAS	MDA	MEX	MGL	MKD	MOZ	MRI	NAM	NED	NGR	NIG	NOR	NZL	PAK	PAN	PAR	PER	PHI	POL	POR	PRK	PUR	QAT	ROU	RSA	RU1	RUS	SCG	SEN	SIN	SLO	SRB	SRI	SUD	SUI	SUR	SVK	SWE	SYR	TAN	TCH	TGA	THA	TJK	TOG	TPE	TRI	TUN	TUR	UAE	UGA	UKR	URS	URU	USA	UZB	VEN	VIE	YUG	ZAM	ZIM	ZZX
Edition
1992	NaN	NaN	0.001173	NaN	0.001173	NaN	0.033431	0.003519	NaN	0.000587	NaN	NaN	0.001760	NaN	NaN	NaN	0.008211	0.009971	NaN	0.025806	NaN	0.048680	NaN	NaN	0.000587	NaN	0.008798	0.041642	NaN	0.008211	NaN	NaN	NaN	NaN	NaN	0.038710	0.001760	0.001760	NaN	0.130792	0.004106	0.033431	NaN	0.029326	NaN	NaN	0.116129	0.007625	0.001173	NaN	NaN	NaN	0.026393	0.003519	NaN	0.00176	0.001760	0.001173	NaN	NaN	0.001173	NaN	0.026979	0.002346	0.027566	NaN	0.004692	NaN	0.028739	NaN	NaN	0.001760	NaN	0.007625	NaN	0.001760	0.001173	NaN	0.000587	0.001173	NaN	NaN	NaN	0.001173	0.019355	0.006452	NaN	0.013490	0.008798	0.009384	NaN	NaN	0.000587	0.000587	0.024633	NaN	0.005865	0.000587	0.000587	0.031085	0.001760	NaN	NaN	NaN	NaN	NaN	0.003519	NaN	NaN	NaN	0.000587	0.000587	NaN	0.020528	NaN	NaN	0.004692	NaN	0.000587	NaN	NaN	0.011730	NaN	NaN	0.003519	NaN	NaN	NaN	NaN	NaN	0.131378	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1996	NaN	NaN	0.001614	NaN	0.010758	0.001076	0.071006	0.001614	0.000538	0.002690	NaN	0.000538	0.003228	NaN	0.012372	NaN	0.034427	0.011296	NaN	0.027434	NaN	0.059172	NaN	NaN	NaN	0.000538	0.015600	0.030662	0.006993	0.012910	NaN	NaN	0.000538	NaN	NaN	0.036041	NaN	0.001614	NaN	NaN	0.002152	0.027434	NaN	0.013986	NaN	0.001076	0.066703	NaN	0.004303	NaN	NaN	0.000538	0.023131	0.003228	0.000538	NaN	0.001614	0.002152	NaN	NaN	0.000538	NaN	0.038193	0.008607	0.023131	0.005917	0.004303	NaN	0.035503	NaN	NaN	0.000538	NaN	0.006455	NaN	0.001076	0.001614	0.001614	0.000538	0.000538	NaN	0.000538	NaN	0.001076	0.039268	0.013986	NaN	0.013448	0.004841	NaN	NaN	NaN	NaN	0.000538	0.011296	0.001614	0.002690	0.000538	NaN	0.020441	0.002690	NaN	0.061861	NaN	NaN	NaN	0.001076	NaN	NaN	NaN	0.005917	NaN	0.001614	0.016676	0.000538	NaN	NaN	0.000538	0.001076	NaN	NaN	0.000538	0.001076	0.000538	0.003228	NaN	0.000538	0.018289	NaN	NaN	0.139860	0.001076	NaN	NaN	0.013986	0.000538	NaN	NaN
2000	NaN	NaN	0.002481	NaN	0.009926	0.000496	0.090819	0.001985	0.001489	0.002978	0.000496	NaN	0.003474	NaN	0.010918	NaN	0.023821	0.006452	NaN	0.015385	0.008933	0.039206	NaN	0.008933	0.000496	0.000993	0.004963	0.034243	0.004467	0.012407	NaN	NaN	NaN	NaN	NaN	0.021340	0.001489	0.003970	NaN	NaN	0.002481	0.032754	NaN	0.027295	NaN	0.002978	0.059057	NaN	0.008933	NaN	NaN	NaN	0.026303	0.003970	0.000496	NaN	0.001985	0.000496	NaN	0.000496	0.000496	NaN	0.032258	0.011414	0.021340	0.003474	0.003474	0.000496	0.036228	0.000993	0.000496	0.001489	NaN	0.008437	NaN	0.002481	NaN	0.000993	0.002978	NaN	0.000496	0.000496	NaN	NaN	0.039206	0.003970	NaN	0.021836	0.001985	NaN	NaN	NaN	NaN	NaN	0.011911	0.000993	0.001985	NaN	0.000496	0.022829	0.002481	NaN	0.093300	NaN	NaN	NaN	0.001489	NaN	0.000496	NaN	0.006948	NaN	0.002978	0.015881	NaN	NaN	NaN	NaN	0.001489	NaN	NaN	0.002481	0.000993	NaN	0.002481	NaN	NaN	0.017370	NaN	0.000496	0.123077	0.001985	NaN	0.000496	0.012903	NaN	NaN	NaN
2004	NaN	NaN	NaN	NaN	0.023524	NaN	0.078579	0.004004	0.002503	0.001001	NaN	NaN	0.001502	NaN	0.008509	NaN	0.020020	0.008509	NaN	0.008509	0.002002	0.047047	NaN	0.000501	0.001001	NaN	0.010010	0.030531	0.006006	0.014515	NaN	0.000501	NaN	0.002503	0.000501	0.013514	0.001502	0.003504	NaN	NaN	0.001001	0.026527	NaN	0.028529	NaN	0.002002	0.074575	NaN	0.015516	NaN	NaN	0.001001	0.020020	0.002503	0.000501	NaN	0.003003	NaN	NaN	NaN	0.001001	NaN	0.051051	0.006507	0.047047	0.004004	0.003504	NaN	0.026026	NaN	NaN	0.002002	NaN	0.001502	NaN	0.001502	NaN	NaN	0.002002	0.000501	NaN	NaN	NaN	NaN	0.038038	0.004004	NaN	0.003504	0.003003	NaN	NaN	0.008509	NaN	NaN	0.006006	0.001502	0.002503	NaN	NaN	0.019520	0.005005	NaN	0.096096	0.007007	NaN	NaN	0.002503	NaN	NaN	NaN	0.003504	NaN	0.005005	0.006006	0.000501	NaN	NaN	NaN	0.004004	NaN	NaN	0.004505	0.000501	NaN	0.005005	0.000501	NaN	0.024024	NaN	NaN	0.132132	0.002503	0.001001	NaN	NaN	NaN	0.001502	NaN
2008	0.00049	NaN	0.000979	NaN	0.024976	0.002938	0.072968	0.001469	0.003428	0.002449	NaN	NaN	0.002449	NaN	0.014691	NaN	0.036729	0.002449	NaN	0.016650	0.000490	0.090108	NaN	0.000490	0.000979	NaN	0.002449	0.023017	0.003428	0.008815	NaN	0.000979	0.000490	0.000490	NaN	0.034770	0.001469	0.003428	NaN	NaN	0.002449	0.037218	NaN	0.037708	NaN	0.002938	0.049461	NaN	0.003428	NaN	NaN	NaN	0.013222	0.003428	0.001469	NaN	0.000979	0.001469	NaN	0.006856	0.000490	NaN	0.020568	0.008325	0.024976	0.006366	0.006856	0.000979	0.038198	NaN	NaN	0.001469	NaN	0.002449	NaN	0.000979	0.000490	0.000490	0.001959	0.001959	NaN	NaN	0.00049	NaN	0.030362	0.011753	NaN	0.010774	0.006856	NaN	0.00049	NaN	NaN	NaN	0.009794	0.000979	0.002938	NaN	NaN	0.010774	0.000490	NaN	0.070029	NaN	NaN	0.001469	0.002449	0.007346	NaN	0.00049	0.005387	NaN	0.004897	0.003428	NaN	NaN	NaN	NaN	0.001959	0.000979	0.00049	0.001959	0.002449	0.000490	0.003918	NaN	NaN	0.015181	NaN	NaN	0.154261	0.002938	0.000490	0.000490	NaN	NaN	0.001959	NaN

Computing percentage change in fraction of medals won

Here, you’ll start with the DataFrames editions, medals, medal_counts, & fractions from prior exercises.

To see if there is a host country advantage, you first want to see how the fraction of medals won changes from edition to edition.

The expanding mean provides a way to see this down each column. It is the value of the mean with all the data available up to that point in time. If you are interested in learning more about pandas’ expanding transformations, this section of the pandas documentation has additional information.

Instructions

Create mean_fractions by chaining the methods .expanding().mean() to fractions.
Compute the percentage change in mean_fractions down each column by applying .pct_change() and multiplying by 100. Assign the result to fractions_change.
Reset the index of fractions_change using the .reset_index() method. This will make 'Edition' an ordinary column.
Print the first and last 5 rows of the DataFrame fractions_change. This has been done for you, so hit ‘Submit Answer’ to see the results!

  
# Apply the expanding mean: mean_fractions
mean_fractions = fractions.expanding().mean()
mean_fractions.head()

NOC	AFG	AHO	ALG	ANZ	ARG	ARM	AUS	AUT	AZE	BAH	BAR	BDI	BEL	BER	BLR	BOH	BRA	BUL	BWI	CAN	CHI	CHN	CIV	CMR	COL	CRC	CRO	CUB	CZE	DEN	DJI	DOM	ECU	EGY	ERI	ESP	EST	ETH	EUA	EUN	FIN	FRA	FRG	GBR	GDR	GEO	GER	GHA	GRE	GUY	HAI	HKG	HUN	INA	IND	IOP	IRI	IRL	IRQ	ISL	ISR	ISV	ITA	JAM	JPN	KAZ	KEN	KGZ	KOR	KSA	KUW	LAT	LIB	LTU	LUX	MAR	MAS	MDA	MEX	MGL	MKD	MOZ	MRI	NAM	NED	NGR	NIG	NOR	NZL	PAK	PAN	PAR	PER	PHI	POL	POR	PRK	PUR	QAT	ROU	RSA	RU1	RUS	SCG	SEN	SIN	SLO	SRB	SRI	SUD	SUI	SUR	SVK	SWE	SYR	TAN	TCH	TGA	THA	TJK	TOG	TPE	TRI	TUN	TUR	UAE	UGA	UKR	URS	URU	USA	UZB	VEN	VIE	YUG	ZAM	ZIM	ZZX
Edition
1896	NaN	NaN	NaN	NaN	NaN	NaN	0.013245	0.033113	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.039735	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.072848	NaN	0.046358	NaN	NaN	0.218543	NaN	0.344371	NaN	NaN	NaN	0.039735	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.019868	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.132450	NaN	NaN	NaN	NaN	NaN	NaN	0.039735
1900	NaN	NaN	NaN	NaN	NaN	NaN	0.011505	0.022416	NaN	NaN	NaN	NaN	0.076172	NaN	NaN	0.003906	NaN	NaN	NaN	0.003906	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.003906	NaN	0.025727	NaN	NaN	NaN	NaN	NaN	0.003906	NaN	NaN	NaN	NaN	NaN	0.217088	NaN	0.099351	NaN	NaN	0.148334	NaN	0.344371	NaN	NaN	NaN	0.024750	NaN	0.003906	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.007812	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.039062	NaN	NaN	0.017578	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.024582	NaN	NaN	0.001953	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.119936	NaN	NaN	NaN	NaN	NaN	NaN	0.053071
1904	NaN	NaN	NaN	NaN	NaN	NaN	0.011505	0.015653	NaN	NaN	NaN	NaN	0.076172	NaN	NaN	0.003906	NaN	NaN	NaN	0.039187	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.011528	NaN	0.025727	NaN	NaN	NaN	NaN	NaN	0.003906	NaN	NaN	NaN	NaN	NaN	0.217088	NaN	0.067652	NaN	NaN	0.108109	NaN	0.174313	NaN	NaN	NaN	0.019337	NaN	0.003906	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.007812	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.039062	NaN	NaN	0.017578	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.017807	NaN	NaN	0.001953	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.359390	NaN	NaN	NaN	NaN	NaN	NaN	0.041054
1908	NaN	NaN	NaN	0.023632	NaN	NaN	0.011505	0.012051	NaN	NaN	NaN	NaN	0.057365	NaN	NaN	0.005063	NaN	NaN	NaN	0.047269	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.011528	NaN	0.023370	NaN	NaN	NaN	NaN	NaN	0.003906	NaN	NaN	NaN	NaN	0.036070	0.159236	NaN	0.158637	NaN	NaN	0.087923	NaN	0.117453	NaN	NaN	NaN	0.020100	NaN	0.003906	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.008259	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.026372	NaN	NaN	0.036152	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.002488	0.003731	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.017807	NaN	NaN	0.061922	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.289132	NaN	NaN	NaN	NaN	NaN	NaN	0.041054
1912	NaN	NaN	NaN	0.017466	NaN	NaN	0.011505	0.012804	NaN	NaN	NaN	NaN	0.045399	NaN	NaN	0.005063	NaN	NaN	NaN	0.037712	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.011528	NaN	0.041256	NaN	NaN	NaN	NaN	NaN	0.003906	NaN	NaN	NaN	NaN	0.055888	0.126489	NaN	0.163068	NaN	NaN	0.082090	NaN	0.088654	NaN	NaN	NaN	0.022860	NaN	0.003906	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.013416	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.025868	NaN	NaN	0.052727	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.005199	0.009775	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.017807	NaN	NaN	0.106441	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.254131	NaN	NaN	NaN	NaN	NaN	NaN	0.041054

  
# Compute the percentage change: fractions_change
fractions_change = mean_fractions.pct_change()*100
fractions_change.head()

NOC	AFG	AHO	ALG	ANZ	ARG	ARM	AUS	AUT	AZE	BAH	BAR	BDI	BEL	BER	BLR	BOH	BRA	BUL	BWI	CAN	CHI	CHN	CIV	CMR	COL	CRC	CRO	CUB	CZE	DEN	DJI	DOM	ECU	EGY	ERI	ESP	EST	ETH	EUA	EUN	FIN	FRA	FRG	GBR	GDR	GEO	GER	GHA	GRE	GUY	HAI	HKG	HUN	INA	IND	IOP	IRI	IRL	IRQ	ISL	ISR	ISV	ITA	JAM	JPN	KAZ	KEN	KGZ	KOR	KSA	KUW	LAT	LIB	LTU	LUX	MAR	MAS	MDA	MEX	MGL	MKD	MOZ	MRI	NAM	NED	NGR	NIG	NOR	NZL	PAK	PAN	PAR	PER	PHI	POL	POR	PRK	PUR	QAT	ROU	RSA	RU1	RUS	SCG	SEN	SIN	SLO	SRB	SRI	SUD	SUI	SUR	SVK	SWE	SYR	TAN	TCH	TGA	THA	TJK	TOG	TPE	TRI	TUN	TUR	UAE	UGA	UKR	URS	URU	USA	UZB	VEN	VIE	YUG	ZAM	ZIM	ZZX
Edition
1896	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1900	NaN	NaN	NaN	NaN	NaN	NaN	-13.134766	-32.304688	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-35.253906	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	198.002486	NaN	114.313616	NaN	NaN	-32.125947	NaN	0.000000	NaN	NaN	NaN	-37.711589	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	23.730469	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-9.448242	NaN	NaN	NaN	NaN	NaN	NaN	33.561198
1904	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	-30.169386	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	0.00000	NaN	NaN	NaN	903.191489	NaN	NaN	NaN	NaN	NaN	NaN	NaN	195.106383	NaN	0.000000	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	-31.905623	NaN	NaN	-27.117728	NaN	-49.382160	NaN	NaN	NaN	-21.871362	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-27.563146	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	199.651245	NaN	NaN	NaN	NaN	NaN	NaN	-22.642384
1908	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	-23.013510	NaN	NaN	NaN	NaN	-24.690649	NaN	NaN	29.60199	NaN	NaN	NaN	20.623816	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	-9.160582	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	-26.649046	NaN	134.489218	NaN	NaN	-18.672328	NaN	-32.619801	NaN	NaN	NaN	3.944407	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5.721393	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-32.487562	NaN	NaN	105.666114	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	3070.398010	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-19.549222	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
1912	NaN	NaN	NaN	-26.092774	NaN	NaN	0.000000	6.254438	NaN	NaN	NaN	NaN	-20.858191	NaN	NaN	0.00000	NaN	NaN	NaN	-20.219099	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	76.534540	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	54.944477	-20.564982	NaN	2.793012	NaN	NaN	-6.634382	NaN	-24.518979	NaN	NaN	NaN	13.729899	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	62.430575	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-1.912744	NaN	NaN	45.846353	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	108.983051	161.977401	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	71.896226	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-12.105733	NaN	NaN	NaN	NaN	NaN	NaN	0.000000

  
# Reset the index of fractions_change: fractions_change
fractions_change = fractions_change.reset_index()

# Print first & last 5 rows of fractions_change
fractions_change.head()

NOC	Edition	AFG	AHO	ALG	ANZ	ARG	ARM	AUS	AUT	AZE	BAH	BAR	BDI	BEL	BER	BLR	BOH	BRA	BUL	BWI	CAN	CHI	CHN	CIV	CMR	COL	CRC	CRO	CUB	CZE	DEN	DJI	DOM	ECU	EGY	ERI	ESP	EST	ETH	EUA	EUN	FIN	FRA	FRG	GBR	GDR	GEO	GER	GHA	GRE	GUY	HAI	HKG	HUN	INA	IND	IOP	IRI	IRL	IRQ	ISL	ISR	ISV	ITA	JAM	JPN	KAZ	KEN	KGZ	KOR	KSA	KUW	LAT	LIB	LTU	LUX	MAR	MAS	MDA	MEX	MGL	MKD	MOZ	MRI	NAM	NED	NGR	NIG	NOR	NZL	PAK	PAN	PAR	PER	PHI	POL	POR	PRK	PUR	QAT	ROU	RSA	RU1	RUS	SCG	SEN	SIN	SLO	SRB	SRI	SUD	SUI	SUR	SVK	SWE	SYR	TAN	TCH	TGA	THA	TJK	TOG	TPE	TRI	TUN	TUR	UAE	UGA	UKR	URS	URU	USA	UZB	VEN	VIE	YUG	ZAM	ZIM	ZZX
0	1896	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1900	NaN	NaN	NaN	NaN	NaN	NaN	-13.134766	-32.304688	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-35.253906	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	198.002486	NaN	114.313616	NaN	NaN	-32.125947	NaN	0.000000	NaN	NaN	NaN	-37.711589	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	23.730469	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-9.448242	NaN	NaN	NaN	NaN	NaN	NaN	33.561198
2	1904	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	-30.169386	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	0.00000	NaN	NaN	NaN	903.191489	NaN	NaN	NaN	NaN	NaN	NaN	NaN	195.106383	NaN	0.000000	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	-31.905623	NaN	NaN	-27.117728	NaN	-49.382160	NaN	NaN	NaN	-21.871362	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-27.563146	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	199.651245	NaN	NaN	NaN	NaN	NaN	NaN	-22.642384
3	1908	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	-23.013510	NaN	NaN	NaN	NaN	-24.690649	NaN	NaN	29.60199	NaN	NaN	NaN	20.623816	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	-9.160582	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	-26.649046	NaN	134.489218	NaN	NaN	-18.672328	NaN	-32.619801	NaN	NaN	NaN	3.944407	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5.721393	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-32.487562	NaN	NaN	105.666114	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	3070.398010	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-19.549222	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
4	1912	NaN	NaN	NaN	-26.092774	NaN	NaN	0.000000	6.254438	NaN	NaN	NaN	NaN	-20.858191	NaN	NaN	0.00000	NaN	NaN	NaN	-20.219099	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	76.534540	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	54.944477	-20.564982	NaN	2.793012	NaN	NaN	-6.634382	NaN	-24.518979	NaN	NaN	NaN	13.729899	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	62.430575	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-1.912744	NaN	NaN	45.846353	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	108.983051	161.977401	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	71.896226	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-12.105733	NaN	NaN	NaN	NaN	NaN	NaN	0.000000

  
fractions_change.tail()

NOC	Edition	AFG	ALG	ARG	ARM	AUS	AUT	AZE	BAH	BAR	BDI	BEL	BLR	BRA	BUL	CAN	CHI	CHN	CMR	COL	CRC	CRO	CUB	CZE	DEN	DOM	ECU	EGY	ERI	ESP	EST	ETH	EUN	FIN	FRA	GBR	GEO	GER	GHA	GRE	HKG	HUN	INA	IND	IOP	IRI	IRL	ISL	ISR	ITA	JAM	JPN	KAZ	KEN	KGZ	KOR	KSA	KUW	LAT	LTU	MAR	MAS	MDA	MEX	MGL	MKD	MOZ	MRI	NAM	NED	NGR	NOR	NZL	PAK	PAN	PAR	PER	PHI	POL	POR	PRK	PUR	QAT	ROU	RSA	RUS	SCG	SIN	SLO	SRB	SRI	SUD	SUI	SUR	SVK	SWE	SYR	TCH	TGA	THA	TJK	TOG	TPE	TRI	TUN	TUR	UAE	UGA	UKR	URU	USA	UZB	VEN	VIE	YUG	ZAM	ZIM
21	1992	NaN	-7.214076	-6.767308	NaN	2.754114	-3.034840	NaN	-24.111659	NaN	NaN	-4.842805	NaN	-0.744112	-5.670409	0.175522	0.000000	4.240630	0.000000	-13.615510	0.000000	NaN	31.154706	NaN	-2.965992	0.000000	NaN	0.000000	NaN	32.943248	-15.272403	0.313086	NaN	-4.446724	-2.246345	-2.731100	NaN	2.172297	167.686073	-7.400672	NaN	-2.260967	40.674487	0.000000	NaN	-4.863231	-5.779911	0.000000	NaN	-2.358676	-2.946081	-0.484390	NaN	-3.902657	NaN	14.789187	NaN	NaN	-3.346081	NaN	4.687834	NaN	NaN	-6.282029	-5.830225	NaN	NaN	NaN	NaN	-1.541059	66.973443	-2.845541	2.032222	-2.482733	0.00000	NaN	-20.453817	-11.275721	-0.301451	0.000000	-0.281514	-11.929985	NaN	0.676152	-6.308757	NaN	NaN	0.000000	NaN	NaN	0.000000	NaN	-4.856259	-4.662757	NaN	-2.992170	0.000000	-4.855929	NaN	-4.038382	NaN	NaN	290.427599	0.000000	0.000000	-2.767259	NaN	0.000000	NaN	0.000000	-1.329330	NaN	0.000000	NaN	0.000000	0.000000	0.000000
22	1996	NaN	8.959211	1.306696	NaN	10.743275	-3.876773	NaN	16.793717	NaN	NaN	-4.230212	NaN	20.111002	-4.363518	0.468867	0.000000	7.860247	0.000000	0.000000	-8.418505	38.658777	11.355483	NaN	-1.767750	0.000000	NaN	0.000000	NaN	19.283123	0.000000	-0.978965	0.0	-4.574592	-2.529696	-3.496901	NaN	-2.870081	0.000000	-6.069972	NaN	-2.422847	6.078215	-7.434730	0.0	-4.589029	0.518200	0.000000	-27.071006	-1.114834	17.777438	-1.362980	NaN	-3.843949	NaN	15.161867	NaN	NaN	-17.886191	-7.669549	-6.128417	18.786982	NaN	-5.880390	-9.762539	NaN	NaN	NaN	-4.142012	2.286353	75.528000	-2.612566	-1.851608	0.000000	0.00000	NaN	0.000000	-9.509095	-3.505201	-4.506051	-10.906079	-9.111125	0.000000	-2.067262	-4.643242	NaN	NaN	0.000000	-34.714004	NaN	0.000000	NaN	-3.310301	0.000000	NaN	-3.160616	-10.758472	0.000000	NaN	12.054737	NaN	NaN	-17.036096	-11.549636	-16.531330	-2.865545	NaN	-15.722487	NaN	0.000000	-1.010378	NaN	0.000000	NaN	-2.667732	-10.758472	0.000000
23	2000	NaN	19.762488	0.515190	-26.935484	12.554986	-3.464221	88.387097	11.693273	NaN	0.0	-3.937948	-5.876578	7.987043	-5.745491	-1.758520	44.530287	-3.851278	326.433237	-11.078926	22.518245	-19.772404	10.379886	-18.064516	-1.748505	0.000000	0.000000	0.000000	NaN	5.576306	-12.732220	16.428481	0.0	-4.272276	-1.986048	-2.517516	88.387097	-3.135334	0.000000	-4.453713	0.000000	-1.943492	9.276908	-6.904472	0.0	-3.012862	-7.598425	-15.108035	-13.995943	-1.571947	19.343853	-1.580262	-20.645161	-4.829604	NaN	11.365816	NaN	NaN	-0.817660	6.614083	11.735730	0.000000	-19.247312	-1.914668	0.000000	NaN	-3.870968	NaN	0.000000	2.011575	-3.792355	-0.614563	-4.152920	0.000000	0.00000	NaN	0.000000	0.000000	-3.059593	-6.171254	-10.388741	0.000000	-7.692308	-1.264420	-4.403586	25.410940	NaN	0.000000	-11.732125	NaN	-29.801489	NaN	-2.862764	0.000000	42.258065	-3.042289	0.000000	0.000000	0.0	16.322880	NaN	NaN	-2.933583	-9.321547	0.000000	-3.697468	NaN	0.000000	-2.514231	-12.025323	-1.341842	42.258065	0.000000	NaN	-2.696445	0.000000	0.000000
24	2004	NaN	0.000000	9.625365	0.000000	8.161162	-2.186922	48.982144	-8.717582	0.0	0.0	-4.191578	-8.978451	4.441670	-4.349906	-2.847128	-8.328117	0.128863	-21.454822	0.196563	0.000000	0.570340	6.083354	1.607119	-1.200641	-13.488488	0.000000	-4.340613	NaN	0.634407	-10.388427	8.378425	0.0	-4.348500	-2.296504	-2.302551	-0.407142	-1.437304	0.000000	-2.478635	43.043043	-2.415933	-4.183048	-6.425639	0.0	-0.032026	0.000000	0.000000	9.013377	0.225280	4.116286	3.545940	-4.909245	-3.886027	0.000000	4.037571	0.0	0.0	5.005598	-19.998650	-1.101722	0.000000	0.000000	-3.085448	-8.419170	0.0	0.000000	NaN	0.000000	1.603836	-2.718009	-4.287069	-2.977620	0.000000	0.00000	NaN	0.000000	0.000000	-4.168815	-3.479153	-6.715705	0.000000	0.000000	-1.864104	-1.272310	7.955310	NaN	0.000000	5.850695	NaN	0.000000	NaN	-3.512295	0.000000	39.338230	-3.786998	-6.057906	0.000000	0.0	51.089437	NaN	NaN	7.728559	-10.784670	0.000000	0.626526	NaN	0.000000	11.580875	0.000000	-1.031922	21.170339	-1.615969	0.000000	0.000000	0.000000	-43.491929
25	2008	NaN	-8.197807	8.588555	91.266408	6.086870	-3.389836	31.764436	3.972696	0.0	0.0	-3.771681	9.650950	11.793640	-6.209067	-1.269829	-12.378553	13.251332	-16.466962	-0.170714	0.000000	-15.024505	1.764921	-10.279513	-2.249725	21.726437	-4.480901	-10.719838	0.0	10.683945	-8.776545	5.835833	0.0	-3.866047	-1.421792	-1.637968	11.391970	-3.234401	0.000000	-5.078186	0.000000	-2.913628	2.177068	-5.541170	0.0	-4.822164	-2.096280	197.441939	-7.788481	-2.404936	6.430486	-0.842564	10.645328	3.239411	48.677767	8.222747	0.0	0.0	-1.303581	-11.844165	-4.891225	-21.618165	-20.806995	-2.888667	4.391010	0.0	0.000000	NaN	0.000000	0.273085	18.041039	-2.580421	0.483767	0.000000	-40.03428	0.0	0.000000	0.000000	-3.027550	-4.856871	-4.162864	0.000000	0.000000	-3.560689	-5.748799	-4.096336	0.0	14.789422	2.813976	NaN	0.000000	NaN	-2.890693	0.000000	13.273239	-3.847829	0.000000	0.000000	0.0	6.022364	NaN	NaN	-4.724313	4.298707	-12.610665	-1.025868	0.0	0.000000	-5.922764	0.000000	-0.450031	14.610625	-6.987342	-0.661117	0.000000	0.000000	-23.316533

Reshaping and plotting

Case Study Explorations

Building hosts DataFrame

Your task here is to prepare a DataFrame hosts by left joining editions and ioc_codes.

Once created, you will subset the Edition and NOC columns and set Edition as the Index.

There are some missing NOC values; you will set those explicitly.

Finally, you’ll reset the Index & print the final DataFrame.

Instructions

Create the DataFrame hosts by doing a left join on DataFrames editions and ioc_codes (using pd.merge()).
Clean up hosts by subsetting and setting the Index.
- Extract the columns 'Edition' and 'NOC'.
- Set 'Edition' column as the Index.
Use the .loc[] accessor to find and assign the missing values to the 'NOC' column in hosts. This has been done for you.
Reset the index of hosts using .reset_index(), which you’ll need to save as the hosts DataFrame.

  
# Left join editions and ioc_codes: hosts
hosts = pd.merge(editions, ioc_codes, how='left')
hosts.head()

	Edition	Grand Total	City	Country	NOC
0	1896	151	Athens	Greece	GRE
1	1900	512	Paris	France	FRA
2	1904	470	St. Louis	United States	USA
3	1908	804	London	United Kingdom	GBR
4	1912	885	Stockholm	Sweden	SWE

  
# Extract relevant columns and set index: hosts
hosts = hosts[['Edition', 'NOC']].set_index('Edition')
hosts.head()

	NOC
Edition
1896	GRE
1900	FRA
1904	USA
1908	GBR
1912	SWE

  
# Fix missing 'NOC' values of hosts
hosts.loc[hosts.NOC.isnull()]

	NOC
Edition
1972	NaN
1980	NaN
1988	NaN

  
hosts.loc[1972, 'NOC'] = 'FRG'
hosts.loc[1980, 'NOC'] = 'URS'
hosts.loc[1988, 'NOC'] = 'KOR'

  
# Reset Index of hosts: hosts
hosts.reset_index(inplace=True)

  
hosts.head()

	Edition	NOC
0	1896	GRE
1	1900	FRA
2	1904	USA
3	1908	GBR
4	1912	SWE

Reshaping for analysis

This exercise starts off with fractions_change and hosts already loaded.

Your task here is to reshape the fractions_change DataFrame for later analysis.

Initially, fractions_change is a wide DataFrame of 26 rows (one for each Olympic edition) and 139 columns (one for the edition and 138 for the competing countries).

On reshaping with pd.melt(), as you will see, the result is a tall DataFrame with 3588 rows and 3 columns that summarizes the fractional change in the expanding mean of the percentage of medals won for each country in blocks.

Instructions

Create a DataFrame reshaped by reshaping the DataFrame fractions_change with pd.melt().
You’ll need to use the keyword argument id_vars='Edition' to set the identifier variable.
You’ll also need to use the keyword argument value_name='Change' to set the measured variables.
Print the shape of the DataFrames reshaped and fractions_change. This has been done for you.
Create a DataFrame chn by extracting all the rows from reshaped in which the three letter code for each country ('NOC') is 'CHN'.
Print the last 5 rows of the DataFrame chn using the .tail() method.

  
# Reshape fractions_change: reshaped
reshaped = pd.melt(fractions_change, id_vars='Edition', value_name='Change')

# Print reshaped.shape and fractions_change.shape
reshaped.shape

(3588, 3)

  
fractions_change.shape

(26, 139)

  
# Extract rows from reshaped where 'NOC' == 'CHN': chn
chn = reshaped[reshaped.NOC == 'CHN']

# Print last 5 rows of chn with .tail()
chn.tail()

	Edition	NOC	Change
567	1992	CHN	4.240630
568	1996	CHN	7.860247
569	2000	CHN	-3.851278
570	2004	CHN	0.128863
571	2008	CHN	13.251332

On looking at the hosting countries from the last 5 Olympic editions and the fractional change of medals won by China the last 5 editions, you can see that China fared significantly better in 2008 (i.e., when China was the host country).

Merging to compute influence

This exercise starts off with the DataFrames reshaped and hosts in the namespace.

Your task is to merge the two DataFrames and tidy the result.

The end result is a DataFrame summarizing the fractional change in the expanding mean of the percentage of medals won for the host country in each Olympic edition.

Instructions

Merge reshaped and hosts using an inner join. Remember, how='inner' is the default behavior for pd.merge().
Print the first 5 rows of the DataFrame merged. This has been done for you. You should see that the rows are jumbled chronologically.
Set the index of merged to be 'Edition' and sort the index.
Print the first 5 rows of the DataFrame influence.

  
# Merge reshaped and hosts: merged
merged = pd.merge(reshaped, hosts, how='inner')
# Print first 5 rows of merged
merged.head()

	Edition	NOC	Change
0	1956	AUS	54.615063
1	2000	AUS	12.554986
2	1920	BEL	54.757887
3	1976	CAN	-2.143977
4	2008	CHN	13.251332

  
# Set Index of merged and sort it: influence
influence = merged.set_index('Edition').sort_index()

# Print first 5 rows of influence
influence.head()

	NOC	Change
Edition
1896	GRE	NaN
1900	FRA	198.002486
1904	USA	199.651245
1908	GBR	134.489218
1912	SWE	71.896226

Plotting influence of host country

This final exercise starts off with the DataFrames influence and editions in the namespace. Your job is to plot the influence of being a host country.

Instructions

Create a Series called change by extracting the 'Change' column from influence.
Create a bar plot of change using the .plot() method with kind='bar'. Save the result as ax to permit further customization.
Customize the bar plot of change to improve readability:
Apply the method .set_ylabel("% Change of Host Country Medal Count") to ax.
Apply the method .set_title("Is there a Host Country Advantage?") to ax.
Apply the method .set_xticklabels(editions['City']) to ax.
Reveal the final plot using plt.show().

  
# Extract influence['Change']: change
change = influence['Change']

# Make bar plot of change: ax
ax = change.plot(kind='bar')

# Customize the plot to improve readability
ax.set_ylabel("% Change of Host Country Medal Count")
ax.set_title("Is there a Host Country Advantage?")
ax.set_xticklabels(editions['City'])

# Display the plot
plt.show()

Certificate

Python, Data Manipulation, Data Analysis, Data Science

This post is licensed under CC BY 4.0 by the author.