Imports

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pprint import pprint as pp
import csv
from pathlib import Path

Pandas Configuration Options

In [2]:
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)

Data Files Location

Data File Objects

In [3]:
data = Path.cwd() / 'data' / 'merging-dataframes-with-pandas'
auto_fuel_file = data / 'auto_fuel_efficiency.csv'
baby_1881_file = data / 'baby_names1881.csv'
baby_1981_file = data / 'baby_names1981.csv'
exch_rates_file = data / 'exchange_rates.csv'
gdp_china_file = data / 'gdp_china.csv'
gdp_usa_file = data / 'gdp_usa.csv'
oil_price_file = data / 'oil_price.csv'
pitts_file = data / 'pittsburgh_weather_2013.csv'
sales_feb_hardware_file = data / 'sales-feb-Hardware.csv'
sales_feb_service_file = data / 'sales-feb-Service.csv'
sales_feb_software_file = data / 'sales-feb-Software.csv'
sales_jan_2015_file = data / 'sales-jan-2015.csv'
sales_feb_2015_file = data / 'sales-feb-2015.csv'
sales_mar_2015_file = data / 'sales-mar-2015.csv'
sp500_file = data / 'sp500.csv'
so_bronze_file = data / 'summer_olympics_Bronze.csv'
so_bronze5_file = data / 'summer_olympics_bronze_top5.csv'
so_gold_file = data / 'summer_olympics_Gold.csv'
so_gold5_file = data / 'summer_olympics_gold_top5.csv'
so_silver_file = data / 'summer_olympics_Silver.csv'
so_silver5_file = data / 'summer_olympics_silver_top5.csv'
so_all_medalists_file = data / 'summer_olympics_medalists 1896 to 2008 - ALL MEDALISTS.tsv'
so_editions_file = data / 'summer_olympics_medalists 1896 to 2008 - EDITIONS.tsv'
so_ioc_codes_file = data / 'summer_olympics_medalists 1896 to 2008 - IOC COUNTRY CODES.csv'

Merging DataFrames with pandas

Course Description

As a Data Scientist, you'll often find that the data you need is not in a single file. It may be spread across a number of text files, spreadsheets, or databases. You want to be able to import the data of interest as a collection of DataFrames and figure out how to combine them to answer your central questions. This course is all about the act of combining, or merging, DataFrames, an essential part of any working Data Scientist's toolbox. You'll hone your pandas skills by learning how to organize, reshape, and aggregate multiple data sets to answer your specific questions.

Preparing Data

In this chapter, you'll learn about different techniques you can use to import multiple files into DataFrames. Having imported your data into individual DataFrames, you'll then learn how to share information between DataFrames using their Indexes. Understanding how Indexes work is essential information that you'll need for merging DataFrames later in the course.

Reading multiple data files

Tools for pandas data import

  • pd.read_csv() for CSV files
    • dataframe = pd.read_csv(filepath)
    • dozens of optional input parameters
  • Other data import tools:
    • pd.read_excel()
    • pd.read_html()
    • pd.read_json()

Loading separate files

import pandas as pd
dataframe0 = pd.read_csv('sales-jan-2015.csv')
dataframe1 = pd.read_csv('sales-feb-2015.csv')

Using a loop

filenames = ['sales-jan-2015.csv', 'sales-feb-2015.csv']
dataframes = []
for f in filenames:
    dataframes.append(pd.read_csv(f))

Using a comprehension

filenames = ['sales-jan-2015.csv', 'sales-feb-2015.csv']
dataframes = [pd.read_csv(f) for f in filenames]

Using glob

from glob import glob
filenames = glob('sales*.csv')
dataframes = [pd.read_csv(f) for f in filenames]

Exercises

Reading DataFrames from multiple files

When data is spread among several files, you usually invoke pandas' read_csv() (or a similar data import function) multiple times to load the data into several DataFrames.

The data files for this example have been derived from a list of Olympic medals awarded between 1896 & 2008 compiled by the Guardian.

The column labels of each DataFrame are NOC, Country, & Total where NOC is a three-letter code for the name of the country and Total is the number of medals of that type won (bronze, silver, or gold).

Instructions

  • Import pandas as pd.
  • Read the file 'Bronze.csv' into a DataFrame called bronze.
  • Read the file 'Silver.csv' into a DataFrame called silver.
  • Read the file 'Gold.csv' into a DataFrame called gold.
  • Print the first 5 rows of the DataFrame gold. This has been done for you, so hit 'Submit Answer' to see the results.
In [4]:
# Read 'Bronze.csv' into a DataFrame: bronze
bronze = pd.read_csv(so_bronze_file)

# Read 'Silver.csv' into a DataFrame: silver
silver = pd.read_csv(so_silver_file)

# Read 'Gold.csv' into a DataFrame: gold
gold = pd.read_csv(so_gold_file)

# Print the first five rows of gold
gold.head()
Out[4]:
NOC Country Total
0 USA United States 2088.0
1 URS Soviet Union 838.0
2 GBR United Kingdom 498.0
3 FRA France 378.0
4 GER Germany 407.0

Reading DataFrames from multiple files in a loop

As you saw in the video, loading data from multiple files into DataFrames is more efficient in a loop or a list comprehension.

Notice that this approach is not restricted to working with CSV files. That is, even if your data comes in other formats, as long as pandas has a suitable data import function, you can apply a loop or comprehension to generate a list of DataFrames imported from the source files.

Here, you'll continue working with The Guardian's Olympic medal dataset.

Instructions

  • Create a list of file names called filenames with three strings 'Gold.csv', 'Silver.csv', & 'Bronze.csv'. This has been done for you.
  • Use a for loop to create another list called dataframes containing the three DataFrames loaded from filenames:
    • Iterate over filenames.
    • Read each CSV file in filenames into a DataFrame and append it to dataframes by using pd.read_csv() inside a call to .append().
  • Print the first 5 rows of the first DataFrame of the list dataframes. This has been done for you, so hit 'Submit Answer' to see the results.
In [5]:
# Create the list of file names: filenames
filenames = [so_bronze_file, so_silver_file, so_gold_file]

# Create the list of three DataFrames: dataframes
dataframes = []
for filename in filenames:
    dataframes.append(pd.read_csv(filename))

# Print top 5 rows of 1st DataFrame in dataframes
dataframes[0].head()
Out[5]:
NOC Country Total
0 USA United States 1052.0
1 URS Soviet Union 584.0
2 GBR United Kingdom 505.0
3 FRA France 475.0
4 GER Germany 454.0

Combining DataFrames from multiple data files

In this exercise, you'll combine the three DataFrames from earlier exercises - gold, silver, & bronze - into a single DataFrame called medals. The approach you'll use here is clumsy. Later on in the course, you'll see various powerful methods that are frequently used in practice for concatenating or merging DataFrames.

Remember, the column labels of each DataFrame are NOC, Country, and Total, where NOC is a three-letter code for the name of the country and Total is the number of medals of that type won.

Instructions

  • Construct a copy of the DataFrame gold called medals using the .copy() method.
  • Create a list called new_labels with entries 'NOC', 'Country', & 'Gold'. This is the same as the column labels from gold with the column label 'Total' replaced by 'Gold'.
  • Rename the columns of medals by assigning new_labels to medals.columns.
  • Create new columns 'Silver' and 'Bronze' in medals using silver['Total'] & bronze['Total'].
  • Print the top 5 rows of the final DataFrame medals. This has been done for you, so hit 'Submit Answer' to see the result!
In [6]:
# Make a copy of gold: medals
medals = gold.copy()

# Create list of new column labels: new_labels
new_labels = ['NOC', 'Country', 'Gold']

# Rename the columns of medals using new_labels
medals.columns = new_labels

# Add columns 'Silver' & 'Bronze' to medals
medals['Silver'] = silver['Total']
medals['Bronze'] = bronze['Total']

# Print the head of medals
medals.head()
Out[6]:
NOC Country Gold Silver Bronze
0 USA United States 2088.0 1195.0 1052.0
1 URS Soviet Union 838.0 627.0 584.0
2 GBR United Kingdom 498.0 591.0 505.0
3 FRA France 378.0 461.0 475.0
4 GER Germany 407.0 350.0 454.0
In [7]:
del bronze, silver, gold, dataframes, medals

Reindexing DataFrames

"Indexes" vs. "Indices"

  • indices: many index labels within Index data structures
  • indexes: many pandas Index data structures

Importing weather data

import pandas as pd
w_mean = pd.read_csv('quarterly_mean_temp.csv', index_col='Month')
w_max = pd.read_csv('quarterly_max_temp.csv', index_col='Month')

Examining the data

print(w_mean)
        Mean TemperatureF
Month
Apr     61.956044
Jan     32.133333
Jul     68.934783
Oct     43.434783

print(w_max)
        Max TemperatureF
Month
Jan     68
Apr     89
Jul     91
Oct     84

The DataFrame indexes

print(w_mean.index)
Index(['Apr', 'Jan', 'Jul', 'Oct'], dtype='object', name='Month')

print(w_max.index)
Index(['Jan', 'Apr', 'Jul', 'Oct'], dtype='object', name='Month')

print(type(w_mean.index))
<class 'pandas.indexes.base.Index'>

Using .reindex()

ordered = ['Jan', 'Apr', 'Jul', 'Oct']
w_mean2 = w_mean.reindex(ordered)
print(w_mean2)

        Mean TemperatureF
Month
Jan     32.133333
Apr     61.956044
Jul     68.934783
Oct     43.434783

Using .sort_index()

w_mean2.sort_index()
        Mean TemperatureF
Month
Apr     61.956044
Jan     32.133333
Jul     68.934783
Oct     43.434783

Reindex from a DataFrame Index

w_mean.reindex(w_max.index)
        Mean TemperatureF
Month
Jan     32.133333
Apr     61.956044
Jul     68.934783
Oct     43.434783

Reindexing with missing labels

w_mean3 = w_mean.reindex(['Jan', 'Apr', 'Dec'])
print(w_mean3)
        Mean TemperatureF
Month
Jan     32.133333
Apr     61.956044
Dec     NaN

Reindex from a DataFrame Index

w_max.reindex(w_mean3.index)
        Max TemperatureF
Month
Jan     68.0
Apr     89.0
Dec     NaN

w_max.reindex(w_mean3.index).dropna()
        Max TemperatureF
Month
Jan     68.0
Apr     89.0

Order matters

w_max.reindex(w_mean.index)
        Max TemperatureF
Month
Apr     89
Jan     68
Jul     91
Oct     84

w_mean.reindex(w_max.index)
        Mean TemperatureF
Month
Jan     32.133333
Apr     61.956044
Jul     68.934783
Oct     43.434783

Exercises

Sorting DataFrame with the Index & columns

It is often useful to rearrange the sequence of the rows of a DataFrame by sorting. You don't have to implement these yourself; the principal methods for doing this are .sort_index() and .sort_values().

In this exercise, you'll use these methods with a DataFrame of temperature values indexed by month names. You'll sort the rows alphabetically using the Index and numerically using a column. Notice, for this data, the original ordering is probably most useful and intuitive: the purpose here is for you to understand what the sorting methods do.

Instructions

  • Read 'monthly_max_temp.csv' into a DataFrame called weather1 with 'Month' as the index.
  • Sort the index of weather1 in alphabetical order using the .sort_index() method and store the result in weather2.
  • Sort the index of weather1 in reverse alphabetical order by specifying the additional keyword argument ascending=False inside .sort_index().
  • Use the .sort_values() method to sort weather1 in increasing numerical order according to the values of the column 'Max TemperatureF'.
In [8]:
monthly_max_temp = {'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
                    'Max TemperatureF': [68, 60, 68, 84, 88, 89, 91, 86, 90, 84, 72, 68]}
In [9]:
# Read 'monthly_max_temp.csv' into a DataFrame: weather1
# weather1 = pd.read_csv('monthly_max_temp.csv', index_col='Month')
weather1 = pd.DataFrame.from_dict(monthly_max_temp)
weather1.set_index('Month', inplace=True)

# Print the head of weather1
print(weather1.head())

# Sort the index of weather1 in alphabetical order: weather2
weather2 = weather1.sort_index()

# Print the head of weather2
print(weather2.head())

# Sort the index of weather1 in reverse alphabetical order: weather3
weather3 = weather1.sort_index(ascending=False)

# Print the head of weather3
print(weather3.head())

# Sort weather1 numerically using the values of 'Max TemperatureF': weather4
weather4 = weather1.sort_values(by='Max TemperatureF')

# Print the head of weather4
print(weather4.head())
       Max TemperatureF
Month                  
Jan                  68
Feb                  60
Mar                  68
Apr                  84
May                  88
       Max TemperatureF
Month                  
Apr                  84
Aug                  86
Dec                  68
Feb                  60
Jan                  68
       Max TemperatureF
Month                  
Sep                  90
Oct                  84
Nov                  72
May                  88
Mar                  68
       Max TemperatureF
Month                  
Feb                  60
Jan                  68
Mar                  68
Dec                  68
Nov                  72

Reindexing DataFrame from a list

Sorting methods are not the only way to change DataFrame Indexes. There is also the .reindex() method.

In this exercise, you'll reindex a DataFrame of quarterly-sampled mean temperature values to contain monthly samples (this is an example of upsampling or increasing the rate of samples, which you may recall from the pandas Foundations course).

The original data has the first month's abbreviation of the quarter (three-month interval) on the Index, namely Apr, Jan, Jul, and Oct. This data has been loaded into a DataFrame called weather1 and has been printed in its entirety in the IPython Shell. Notice it has only four rows (corresponding to the first month of each quarter) and that the rows are not sorted chronologically.

You'll initially use a list of all twelve month abbreviations and subsequently apply the .ffill() method to forward-fill the null entries when upsampling. This list of month abbreviations has been pre-loaded as year.

Instructions

  • Reorder the rows of weather1 using the .reindex() method with the list year as the argument, which contains the abbreviations for each month.
  • Reorder the rows of weather1 just as you did above, this time chaining the .ffill() method to replace the null values with the last preceding non-null value.
In [10]:
monthly_max_temp = {'Month': ['Jan', 'Apr', 'Jul', 'Oct'],
                    'Max TemperatureF': [32.13333, 61.956044, 68.934783, 43.434783]}
weather1 = pd.DataFrame.from_dict(monthly_max_temp)
weather1.set_index('Month', inplace=True)
weather1
Out[10]:
Max TemperatureF
Month
Jan 32.133330
Apr 61.956044
Jul 68.934783
Oct 43.434783
In [11]:
year = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Reindex weather1 using the list year: weather2
weather2 = weather1.reindex(year)

# Print weather2
weather2
Out[11]:
Max TemperatureF
Month
Jan 32.133330
Feb NaN
Mar NaN
Apr 61.956044
May NaN
Jun NaN
Jul 68.934783
Aug NaN
Sep NaN
Oct 43.434783
Nov NaN
Dec NaN
In [12]:
# Reindex weather1 using the list year with forward-fill: weather3
weather3 = weather1.reindex(year).ffill()

# Print weather3
weather3
Out[12]:
Max TemperatureF
Month
Jan 32.133330
Feb 32.133330
Mar 32.133330
Apr 61.956044
May 61.956044
Jun 61.956044
Jul 68.934783
Aug 68.934783
Sep 68.934783
Oct 43.434783
Nov 43.434783
Dec 43.434783

Reindexing DataFrame using another DataFrame Index

Another common technique is to reindex a DataFrame using the Index of another DataFrame. The DataFrame .reindex() method can accept the Index of a DataFrame or Series as input. You can access the Index of a DataFrame with its .index attribute.

The Baby Names Dataset from data.gov summarizes counts of names (with genders) from births registered in the US since 1881. In this exercise, you will start with two baby-names DataFrames names_1981 and names_1881 loaded for you.

The DataFrames names_1981 and names_1881 both have a MultiIndex with levels name and gender giving unique labels to counts in each row. If you're interested in seeing how the MultiIndexes were set up, names_1981 and names_1881 were read in using the following commands:

names_1981 = pd.read_csv('names1981.csv', header=None, names=['name','gender','count'], index_col=(0,1))
names_1881 = pd.read_csv('names1881.csv', header=None, names=['name','gender','count'], index_col=(0,1))

As you can see by looking at their shapes, which have been printed in the IPython Shell, the DataFrame corresponding to 1981 births is much larger, reflecting the greater diversity of names in 1981 as compared to 1881.

Your job here is to use the DataFrame .reindex() and .dropna() methods to make a DataFrame common_names counting names from 1881 that were still popular in 1981.

Instructions

  • Create a new DataFrame common_names by reindexing names_1981 using the Index of the DataFrame names_1881 of older names.
  • Print the shape of the new common_names DataFrame. This has been done for you. It should be the same as that of names_1881.
  • Drop the rows of common_names that have null counts using the .dropna() method. These rows correspond to names that fell out of fashion between 1881 & 1981.
  • Print the shape of the reassigned common_names DataFrame. This has been done for you, so hit 'Submit Answer' to see the result!
In [13]:
names_1981 = pd.read_csv(baby_1981_file, header=None, names=['name', 'gender', 'count'], index_col=(0,1))
names_1981.head()
Out[13]:
count
name gender
Jennifer F 57032
Jessica F 42519
Amanda F 34370
Sarah F 28162
Melissa F 28003
In [14]:
names_1881 = pd.read_csv(baby_1881_file, header=None, names=['name','gender','count'], index_col=(0,1))
names_1881.head()
Out[14]:
count
name gender
Mary F 6919
Anna F 2698
Emma F 2034
Elizabeth F 1852
Margaret F 1658
In [15]:
# Reindex names_1981 with index of names_1881: common_names
common_names = names_1981.reindex(names_1881.index)

# Print shape of common_names
common_names.shape
Out[15]:
(1935, 1)
In [16]:
# Drop rows with null counts: common_names
common_names = common_names.dropna()

# Print shape of new common_names
common_names.shape
Out[16]:
(1587, 1)
In [17]:
common_names.head(10)
Out[17]:
count
name gender
Mary F 11030.0
Anna F 5182.0
Emma F 532.0
Elizabeth F 20168.0
Margaret F 2791.0
Minnie F 56.0
Ida F 206.0
Annie F 973.0
Bertha F 209.0
Alice F 745.0
In [18]:
del weather1, weather2, weather3, weather4, common_names, names_1881, names_1981

Arithmetic with Series & DataFrames

Loading weather data

import pandas as pd
weather = pd.read_csv('pittsburgh2013.csv', index_col='Date', parse_dates=True)
weather.loc['2013-7-1':'2013-7-7', 'PrecipitationIn']

Date
2013-07-01 0.18
2013-07-02 0.14
2013-07-03 0.00
2013-07-04 0.25
2013-07-05 0.02
2013-07-06 0.06
2013-07-07 0.10
Name: PrecipitationIn, dtype: float64

Scalar multiplication

weather.loc['2013-07-01':'2013-07-07', 'PrecipitationIn'] * 2.54

Date
2013-07-01 0.4572
2013-07-02 0.3556
2013-07-03 0.0000
2013-07-04 0.6350
2013-07-05 0.0508
2013-07-06 0.1524
2013-07-07 0.2540
Name: PrecipitationIn, dtype: float64

Absolute temperature range

week1_range = weather.loc['2013-07-01':'2013-07-07', ['Min TemperatureF', 'Max TemperatureF']]
print(week1_range)
Min TemperatureF Max TemperatureF
Date
2013-07-01 66 79
2013-07-02 66 84
2013-07-03 71 86
2013-07-04 70 86
2013-07-05 69 86
2013-07-06 70 89
2013-07-07 70 77

Average temperature

week1_mean = weather.loc['2013-07-01':'2013-07-07', 'Mean TemperatureF']
print(week1_mean)
Date
2013-07-01 72
2013-07-02 74
2013-07-03 78
2013-07-04 77
2013-07-05 76
2013-07-06 78
2013-07-07 72
Name: Mean TemperatureF, dtype: int64

Relative temperature range

week1_range / week1_mean
RuntimeWarning: Cannot compare type 'Timestamp' with type 'str', sort order is
undefined for incomparable objects
return this.join(other, how=how, return_indexers=return_indexers)

2013-07-01 00:00:00 2013-07-02 00:00:00 2013-07-03 00:00:00 \
Date
2013-07-01 NaN NaN NaN
2013-07-02 NaN NaN NaN
2013-07-03 NaN NaN NaN
2013-07-04 NaN NaN NaN
2013-07-05 NaN NaN NaN
2013-07-06 NaN NaN NaN
2013-07-07 NaN NaN NaN
2013-07-04 00:00:00 2013-07-05 00:00:00 2013-07-06 00:00:00 \
Date
2013-07-01 NaN NaN NaN
... ...

Relative temperature range

week1_range.divide(week1_mean, axis='rows')

Min TemperatureF Max TemperatureF
Date
2013-07-01 0.916667 1.097222
2013-07-02 0.891892 1.135135
2013-07-03 0.910256 1.102564
2013-07-04 0.909091 1.116883
2013-07-05 0.907895 1.131579
2013-07-06 0.897436 1.141026
2013-07-07 0.972222 1.069444

Percentage changes

week1_mean.pct_change() * 100

Date
2013-07-01 NaN
2013-07-02 2.777778
2013-07-03 5.405405
2013-07-04 -1.282051
2013-07-05 -1.298701
2013-07-06 2.631579
2013-07-07 -7.692308
Name: Mean TemperatureF, dtype: float64

Bronze Olympic medals

bronze = pd.read_csv('bronze_top5.csv', index_col=0)
print(bronze)
Total
Country
United States 1052.0
Soviet Union 584.0
United Kingdom 505.0
France 475.0
Germany 454.0

Silver Olympic medals

silver = pd.read_csv('silver_top5.csv', index_col=0)
print(silver)
Total
Country
United States 1195.0
Soviet Union 627.0
United Kingdom 591.0
France 461.0
Italy 394.0

Gold Olympic medals

gold = pd.read_csv('gold_top5.csv', index_col=0)
print(gold)
Total
Country
United States 2088.0
Soviet Union 838.0
United Kingdom 498.0
Italy 460.0
Germany 407.0

Adding bronze, silver

bronze + silver

Country
France 936.0
Germany NaN
Italy NaN
Soviet Union 1211.0
United Kingdom 1096.0
United States 2247.0
Name: Total, dtype: float64

Adding bronze, silver

bronze + silver

Country
France 936.0
Germany NaN
Italy NaN
Soviet Union 1211.0
United Kingdom 1096.0
United States 2247.0
Name: Total, dtype: float64
In [22]: print(bronze['United States'])
1052.0
In [23]: print(silver['United States'])
1195.0

Using the .add() method

bronze.add(silver)

Country
France 936.0
Germany NaN
Italy NaN
Soviet Union 1211.0
United Kingdom 1096.0
United States 2247.0
Name: Total, dtype: float64

Using a fill_value

bronze.add(silver, fill_value=0)

Country
France 936.0
Germany 454.0
Italy 394.0
Soviet Union 1211.0
United Kingdom 1096.0
United States 2247.0
Name: Total, dtype: float64

Adding bronze, silver, gold

bronze + silver + gold

Country
France NaN
Germany NaN
Italy NaN
Soviet Union 2049.0
United Kingdom 1594.0
United States 4335.0
Name: Total, dtype: float64

Chaining .add()

bronze.add(silver, fill_value=0).add(gold, fill_value=0)

Country
France 936.0
Germany 861.0
Italy 854.0
Soviet Union 2049.0
United Kingdom 1594.0
United States 4335.0
Name: Total, dtype: float64

Exercises

Adding unaligned DataFrames

The DataFrames january and february, which have been printed in the IPython Shell, represent the sales a company made in the corresponding months.

The Indexes in both DataFrames are called Company, identifying which company bought that quantity of units. The column Units is the number of units sold.

If you were to add these two DataFrames by executing the command total = january + february, how many rows would the resulting DataFrame have? Try this in the IPython Shell and find out for yourself.

In [19]:
jan_dict = {'Company': ['Acme Corporation', 'Hooli', 'Initech', 'Mediacore', 'Streeplex'],
            'Units': [19, 17, 20, 10, 13]}
feb_dict = {'Company': ['Acme Corporation', 'Hooli', 'Mediacore', 'Vandelay Inc'],
            'Units': [15, 3, 12, 25]}

january = pd.DataFrame.from_dict(jan_dict)
january.set_index('Company', inplace=True)
print(january)

february = pd.DataFrame.from_dict(feb_dict)
february.set_index('Company', inplace=True)
print('\n', february, '\n')

print(january + february)
                  Units
Company                
Acme Corporation     19
Hooli                17
Initech              20
Mediacore            10
Streeplex            13

                   Units
Company                
Acme Corporation     15
Hooli                 3
Mediacore            12
Vandelay Inc         25 

                  Units
Company                
Acme Corporation   34.0
Hooli              20.0
Initech             NaN
Mediacore          22.0
Streeplex           NaN
Vandelay Inc        NaN

Broadcasting in Arithmetic formulas

In this exercise, you'll work with weather data pulled from wunderground.com. The DataFrame weather has been pre-loaded along with pandas as pd. It has 365 rows (observed each day of the year 2013 in Pittsburgh, PA) and 22 columns reflecting different weather measurements each day.

You'll subset a collection of columns related to temperature measurements in degrees Fahrenheit, convert them to degrees Celsius, and relabel the columns of the new DataFrame to reflect the change of units.

Remember, ordinary arithmetic operators (like +, -, *, and /) broadcast scalar values to conforming DataFrames when combining scalars & DataFrames in arithmetic expressions. Broadcasting also works with pandas Series and NumPy arrays.

Instructions

  • Create a new DataFrame temps_f by extracting the columns 'Min TemperatureF', 'Mean TemperatureF', & 'Max TemperatureF' from weather as a new DataFrame temps_f. To do this, pass the relevant columns as a list to weather[].
  • Create a new DataFrame temps_c from temps_f using the formula (temps_f - 32) * 5/9.
  • Rename the columns of temps_c to replace 'F' with 'C' using the .str.replace('F', 'C') method on temps_c.columns.
  • Print the first 5 rows of DataFrame temps_c. This has been done for you, so hit 'Submit Answer' to see the result!
In [20]:
weather = pd.read_csv(pitts_file)
weather.set_index('Date', inplace=True)
weather.head(3)
Out[20]:
Max TemperatureF Mean TemperatureF Min TemperatureF Max Dew PointF MeanDew PointF Min DewpointF Max Humidity Mean Humidity Min Humidity Max Sea Level PressureIn Mean Sea Level PressureIn Min Sea Level PressureIn Max VisibilityMiles Mean VisibilityMiles Min VisibilityMiles Max Wind SpeedMPH Mean Wind SpeedMPH Max Gust SpeedMPH PrecipitationIn CloudCover Events WindDirDegrees
Date
2013-1-1 32 28 21 30 27 16 100 89 77 30.10 30.01 29.94 10 6 2 10 8 NaN 0.0 8 Snow 277
2013-1-2 25 21 17 14 12 10 77 67 55 30.27 30.18 30.08 10 10 10 14 5 NaN 0.0 4 NaN 272
2013-1-3 32 24 16 19 15 9 77 67 56 30.25 30.21 30.16 10 10 10 17 8 26.0 0.0 3 NaN 229
In [21]:
# Extract selected columns from weather as new DataFrame: temps_f
temps_f = weather[['Min TemperatureF', 'Mean TemperatureF', 'Max TemperatureF']]

# Convert temps_f to celsius: temps_c
temps_c = (temps_f - 32) * 5/9

# Rename 'F' in column names with 'C': temps_c.columns
temps_c.columns = temps_c.columns.str.replace('F', 'C')

# Print first 5 rows of temps_c
temps_c.head()
Out[21]:
Min TemperatureC Mean TemperatureC Max TemperatureC
Date
2013-1-1 -6.111111 -2.222222 0.000000
2013-1-2 -8.333333 -6.111111 -3.888889
2013-1-3 -8.888889 -4.444444 0.000000
2013-1-4 -2.777778 -2.222222 -1.111111
2013-1-5 -3.888889 -1.111111 1.111111

Computing percentage growth of GDP

Your job in this exercise is to compute the yearly percent-change of US GDP (Gross Domestic Product) since 2008.

The data has been obtained from the Federal Reserve Bank of St. Louis and is available in the file GDP.csv, which contains quarterly data; you will resample it to annual sampling and then compute the annual growth of GDP. For a refresher on resampling, check out the relevant material from pandas Foundations.

Instructions

  • Read the file 'GDP.csv' into a DataFrame called gdp.
  • Use parse_dates=True and index_col='DATE'.
  • Create a DataFrame post2008 by slicing gdp such that it comprises all rows from 2008 onward.
  • Print the last 8 rows of the slice post2008. This has been done for you. This data has quarterly frequency so the indices are separated by three-month intervals.
  • Create the DataFrame yearly by resampling the slice post2008 by year. Remember, you need to chain .resample() (using the alias 'A' for annual frequency) with some kind of aggregation; you will use the aggregation method .last() to select the last element when resampling.
  • Compute the percentage growth of the resampled DataFrame yearly with .pct_change() * 100.
In [22]:
# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv(gdp_usa_file, parse_dates=True, index_col='DATE')
gdp.head()
Out[22]:
VALUE
DATE
1947-01-01 243.1
1947-04-01 246.3
1947-07-01 250.1
1947-10-01 260.3
1948-01-01 266.2
In [23]:
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008-01-01':]

# Print the last 8 rows of post2008
post2008.tail(8)
Out[23]:
VALUE
DATE
2014-07-01 17569.4
2014-10-01 17692.2
2015-01-01 17783.6
2015-04-01 17998.3
2015-07-01 18141.9
2015-10-01 18222.8
2016-01-01 18281.6
2016-04-01 18436.5
In [24]:
# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()

# Print yearly
yearly
Out[24]:
VALUE
DATE
2008-12-31 14549.9
2009-12-31 14566.5
2010-12-31 15230.2
2011-12-31 15785.3
2012-12-31 16297.3
2013-12-31 16999.9
2014-12-31 17692.2
2015-12-31 18222.8
2016-12-31 18436.5
In [25]:
# Compute percentage growth of yearly: yearly['growth']
yearly['growth'] = yearly.pct_change() * 100

# Print yearly again
yearly
Out[25]:
VALUE growth
DATE
2008-12-31 14549.9 NaN
2009-12-31 14566.5 0.114090
2010-12-31 15230.2 4.556345
2011-12-31 15785.3 3.644732
2012-12-31 16297.3 3.243524
2013-12-31 16999.9 4.311144
2014-12-31 17692.2 4.072377
2015-12-31 18222.8 2.999062
2016-12-31 18436.5 1.172707

Converting currency of stocks

In this exercise, stock prices in US Dollars for the S&P 500 in 2015 have been obtained from Yahoo Finance. The files sp500.csv for sp500 and exchange.csv for the exchange rates are both provided to you.

Using the daily exchange rate to Pounds Sterling, your task is to convert both the Open and Close column prices.

Instructions

  • Read the DataFrames sp500 & exchange from the files 'sp500.csv' & 'exchange.csv' respectively..
  • Use parse_dates=True and index_col='Date'.
  • Extract the columns 'Open' & 'Close' from the DataFrame sp500 as a new DataFrame dollars and print the first 5 rows.
  • Construct a new DataFrame pounds by converting US dollars to British pounds. You'll use the .multiply() method of dollars with exchange['GBP/USD'] and axis='rows'
  • Print the first 5 rows of the new DataFrame pounds. This has been done for you, so hit 'Submit Answer' to see the results!.
In [26]:
# Read 'sp500.csv' into a DataFrame: sp500
sp500 = pd.read_csv(sp500_file, parse_dates=True, index_col='Date')
sp500.head()
Out[26]:
Open High Low Close Volume Adj Close
Date
2015-01-02 2058.899902 2072.360107 2046.040039 2058.199951 2708700000 2058.199951
2015-01-05 2054.439941 2054.439941 2017.339966 2020.579956 3799120000 2020.579956
2015-01-06 2022.150024 2030.250000 1992.439941 2002.609985 4460110000 2002.609985
2015-01-07 2005.550049 2029.609985 2005.550049 2025.900024 3805480000 2025.900024
2015-01-08 2030.609985 2064.080078 2030.609985 2062.139893 3934010000 2062.139893
In [27]:
# Read 'exchange.csv' into a DataFrame: exchange
exchange = pd.read_csv(exch_rates_file, parse_dates=True, index_col='Date')
exchange.head()
Out[27]:
GBP/USD
Date
2015-01-02 0.65101
2015-01-05 0.65644
2015-01-06 0.65896
2015-01-07 0.66344
2015-01-08 0.66151
In [28]:
# Subset 'Open' & 'Close' columns from sp500: dollars
dollars = sp500[['Open', 'Close']]

# Print the head of dollars
dollars.head()
Out[28]:
Open Close
Date
2015-01-02 2058.899902 2058.199951
2015-01-05 2054.439941 2020.579956
2015-01-06 2022.150024 2002.609985
2015-01-07 2005.550049 2025.900024
2015-01-08 2030.609985 2062.139893
In [29]:
# Convert dollars to pounds: pounds
pounds = dollars.multiply(exchange['GBP/USD'], axis='rows')

# Print the head of pounds
pounds.head()
Out[29]:
Open Close
Date
2015-01-02 1340.364425 1339.908750
2015-01-05 1348.616555 1326.389506
2015-01-06 1332.515980 1319.639876
2015-01-07 1330.562125 1344.063112
2015-01-08 1343.268811 1364.126161
In [30]:
del january, february, feb_dict, jan_dict, weather, temps_f, temps_c, gdp, post2008, yearly, sp500, exchange, dollars, pounds

Concatenating Data

Having learned how to import multiple DataFrames and share information using Indexes, in this chapter you'll learn how to perform database-style operations to combine DataFrames. In particular, you'll learn about appending and concatenating DataFrames while working with a variety of real-world datasets.

Appending & concatenating Series

append()

  • .append(): Series & DataFrame method
  • Invocation:
  • s1.append(s2)
  • Stacks rows of s2 below s1
  • Method for Series & DataFrames

concat()

  • concat(): pandas module function
  • Invocation:
  • pd.concat([s1, s2, s3])
  • Can stack row-wise or column-wise

concat() & .append()

  • Equivalence of concat() & .append():
  • result1 = pd.concat([s1, s2, s3])
  • result2 = s1.append(s2).append(s3)
  • result1 == result2 elementwise

Series of US states

import pandas as pd
northeast = pd.Series(['CT', 'ME', 'MA', 'NH', 'RI', 'VT', 'NJ', 'NY', 'PA'])
south = pd.Series(['DE', 'FL', 'GA', 'MD', 'NC', 'SC', 'VA', 'DC', 'WV', 'AL', 'KY', 'MS', 'TN', 'AR', 'LA', 'OK', 'TX'])
midwest = pd.Series(['IL', 'IN', 'MN', 'MO', 'NE', 'ND', 'SD', 'IA', 'KS', 'MI', 'OH', 'WI'])
west = pd.Series(['AZ', 'CO', 'ID', 'MT',

Using .append()

east = northeast.append(south)
print(east)
0 CT       7 DC
1 ME       8 WV
2 MA       9 AL
3 NH       10 KY
4 RI       11 MS
5 VT       12 TN
6 NJ       13 AR
7 NY       14 LA
8 PA       15 OK
0 DE       16 TX
1 FL       dtype: object
2 GA
3 MD
4 NC
5 SC
6 VA

The appended Index

print(east.index)
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], dtype='int64')

print(east.loc[3])
3 NH
3 MD
dtype: object

Using .reset_index()

new_east = northeast.append(south).reset_index(drop=True)
print(new_east.head(11))
0 CT
1 ME
2 MA
3 NH
4 RI
5 VT
6 NJ
7 NY
8 PA
9 DE
10 FL
dtype: object

print(new_east.index)
RangeIndex(start=0, stop=26, step=1)

Using concat()

east = pd.concat([northeast, south])
print(east.head(11))
0 CT
1 ME
2 MA
3 NH
4 RI
5 VT
6 NJ
7 NY
8 PA
0 DE
1 FL
dtype: object
print(east.index)
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], dtype='int64')

Using ignore_index

new_east = pd.concat([northeast, south], ignore_index=True)
print(new_east.head(11))
0 CT
1 ME
2 MA
3 NH
4 RI
5 VT
6 NJ
7 NY
8 PA
9 DE
10 FL
dtype: object
print(new_east.index)
RangeIndex(start=0, stop=26, step=1)

Exercises

Appending Series with nonunique Indices

The Series bronze and silver, which have been printed in the IPython Shell, represent the 5 countries that won the most bronze and silver Olympic medals respectively between 1896 & 2008. The Indexes of both Series are called Country and the values are the corresponding number of medals won.

If you were to run the command combined = bronze.append(silver), how many rows would combined have? And how many rows would combined.loc['United States'] return? Find out for yourself by running these commands in the IPython Shell.

Instructions

Possible Answers

  • combined has 5 rows and combined.loc['United States'] is empty (0 rows).
  • combined has 10 rows and combined.loc['United States'] has 2 rows.
  • combined has 6 rows and combined.loc['United States'] has 1 row.
  • combined has 5 rows and combined.loc['United States'] has 2 rows.
In [31]:
bronze = pd.read_csv(so_bronze5_file, index_col=0)
bronze
Out[31]:
Total
Country
United States 1052.0
Soviet Union 584.0
United Kingdom 505.0
France 475.0
Germany 454.0
In [32]:
silver = pd.read_csv(so_silver5_file, index_col=0)
silver
Out[32]:
Total
Country
United States 1195.0
Soviet Union 627.0
United Kingdom 591.0
France 461.0
Italy 394.0
In [33]:
combined = bronze.append(silver)
combined
Out[33]:
Total
Country
United States 1052.0
Soviet Union 584.0
United Kingdom 505.0
France 475.0
Germany 454.0
United States 1195.0
Soviet Union 627.0
United Kingdom 591.0
France 461.0
Italy 394.0
In [34]:
combined.loc['United States']
Out[34]:
Total
Country
United States 1052.0
United States 1195.0

Appending pandas Series

In this exercise, you'll load sales data from the months January, February, and March into DataFrames. Then, you'll extract Series with the 'Units' column from each and append them together with method chaining using .append().

To check that the stacking worked, you'll print slices from these Series, and finally, you'll add the result to figure out the total units sold in the first quarter.

Instructions

  • Read the files 'sales-jan-2015.csv', 'sales-feb-2015.csv' and 'sales-mar-2015.csv' into the DataFrames jan, feb, and mar respectively.
  • Use parse_dates=True and index_col='Date'.
  • Extract the 'Units' column of jan, feb, and mar to create the Series jan_units, feb_units, and mar_units respectively.
  • Construct the Series quarter1 by appending feb_units to jan_units and then appending mar_units to the result. Use chained calls to the .append() method to do this.
  • Verify that quarter1 has the individual Series stacked vertically. To do this:
  • Print the slice containing rows from jan 27, 2015 to feb 2, 2015.
  • Print the slice containing rows from feb 26, 2015 to mar 7, 2015.
  • Compute and print the total number of units sold from the Series quarter1. This has been done for you, so hit 'Submit Answer' to see the result!
In [35]:
# Load 'sales-jan-2015.csv' into a DataFrame: jan
jan = pd.read_csv(sales_jan_2015_file, parse_dates=True, index_col='Date')

# Load 'sales-feb-2015.csv' into a DataFrame: feb
feb = pd.read_csv(sales_feb_2015_file, parse_dates=True, index_col='Date')

# Load 'sales-mar-2015.csv' into a DataFrame: mar
mar = pd.read_csv(sales_mar_2015_file, parse_dates=True, index_col='Date')

# Extract the 'Units' column from jan: jan_units
jan_units = jan['Units']

# Extract the 'Units' column from feb: feb_units
feb_units = feb['Units']

# Extract the 'Units' column from mar: mar_units
mar_units = mar['Units']

# Append feb_units and then mar_units to jan_units: quarter1
quarter1 = jan_units.append(feb_units).append(mar_units)

# Print the first slice from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])

# Print the second slice from quarter1
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])

# Compute & print total sales in quarter1
print(quarter1.sum())
Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64
Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64
642

Concatenating pandas Series along row axis

Having learned how to append Series, you'll now learn how to achieve the same result by concatenating Series instead. You'll continue to work with the sales data you've seen previously. This time, the DataFrames jan, feb, andmar have been pre-loaded.

Your job is to use pd.concat() with a list of Series to achieve the same result that you would get by chaining calls to .append().

You may be wondering about the difference between pd.concat() and pandas' .append() method. One way to think of the difference is that .append() is a specific case of a concatenation, while pd.concat() gives you more flexibility, as you'll see in later exercises.

Instructions

  • Create an empty list called units. This has been done for you.
    • Use a for loop to iterate over [jan, feb, mar]:
  • In each iteration of the loop, append the 'Units' column of each DataFrame to units.
    • Concatenate the Series contained in the list units into a longer Series called quarter1 using pd.concat().
  • Specify the keyword argument axis='rows' to stack the Series vertically.
  • Verify that quarter1 has the individual Series stacked vertically by printing slices. This has been done for you, so hit 'Submit Answer' to see the result!
In [36]:
# Initialize empty list: units
units = []

# Build the list of Series
for month in [jan, feb, mar]:
    units.append(month.Units)

# Concatenate the list: quarter1
quarter1 = pd.concat(units, axis='rows')

# Print slices from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])
Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64
Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64
In [37]:
del bronze, silver, combined, jan, feb, mar, jan_units, feb_units, mar_units, quarter1

Appending & concatenating DataFrames

Loading population data

In [1]: import pandas as pd
In [2]: pop1 = pd.read_csv('population_01.csv', index_col=0)
In [3]: pop2 = pd.read_csv('population_02.csv', index_col=0)
In [38]:
pop1_data = {'Zip Code ZCTA': [66407, 72732, 50579, 46421], '2010 Census Population': [479, 4716, 2405, 30670]}
pop2_data = {'Zip Code ZCTA': [12776, 76092, 98360, 49464], '2010 Census Population': [2180, 26669, 12221, 27481]}

pop1 = pd.DataFrame.from_dict(pop1_data)
pop1.set_index('Zip Code ZCTA', drop=True, inplace=True)
pop2 = pd.DataFrame.from_dict(pop2_data)
pop2.set_index('Zip Code ZCTA', drop=True, inplace=True)

Examining population data

In [39]:
pop1
Out[39]:
2010 Census Population
Zip Code ZCTA
66407 479
72732 4716
50579 2405
46421 30670
In [40]:
pop2
Out[40]:
2010 Census Population
Zip Code ZCTA
12776 2180
76092 26669
98360 12221
49464 27481
In [41]:
print(type(pop1), pop1.shape)
print(type(pop2), pop2.shape)
<class 'pandas.core.frame.DataFrame'> (4, 1)
<class 'pandas.core.frame.DataFrame'> (4, 1)

Appending population DataFrames

In [42]:
pop1.append(pop2)
Out[42]:
2010 Census Population
Zip Code ZCTA
66407 479
72732 4716
50579 2405
46421 30670
12776 2180
76092 26669
98360 12221
49464 27481
In [43]:
print(pop1.index.name, pop1.columns)
print(pop2.index.name, pop2.columns)
Zip Code ZCTA Index(['2010 Census Population'], dtype='object')
Zip Code ZCTA Index(['2010 Census Population'], dtype='object')

Population & unemployment data

population = pd.read_csv('population_00.csv', index_col=0)
unemployment = pd.read_csv('unemployment_00.csv', index_col=0)
In [44]:
pop_data = {'Zip Code ZCTA': [57538, 59916, 37660, 2860], '2010 Census Population': [322, 130, 40038, 45199]}
emp_data = {'Zip': [2860, 46167, 1097, 80808], 'unemployment': [0.11, 0.02, 0.33, 0.07], 'participants': [34447, 4800, 42, 4310]}

population = pd.DataFrame.from_dict(pop_data)
population.set_index('Zip Code ZCTA', drop=True, inplace=True)
unemployment = pd.DataFrame.from_dict(emp_data)
unemployment.set_index('Zip', drop=True, inplace=True)
In [45]:
population
Out[45]:
2010 Census Population
Zip Code ZCTA
57538 322
59916 130
37660 40038
2860 45199
In [46]:
unemployment
Out[46]:
unemployment participants
Zip
2860 0.11 34447
46167 0.02 4800
1097 0.33 42
80808 0.07 4310

Appending population & unemployment

In [47]:
population.append(unemployment, sort=True)
Out[47]:
2010 Census Population participants unemployment
57538 322.0 NaN NaN
59916 130.0 NaN NaN
37660 40038.0 NaN NaN
2860 45199.0 NaN NaN
2860 NaN 34447.0 0.11
46167 NaN 4800.0 0.02
1097 NaN 42.0 0.33
80808 NaN 4310.0 0.07

Repeated index labels

In [48]:
population.append(unemployment, sort=True)
Out[48]:
2010 Census Population participants unemployment
57538 322.0 NaN NaN
59916 130.0 NaN NaN
37660 40038.0 NaN NaN
2860 45199.0 NaN NaN
2860 NaN 34447.0 0.11
46167 NaN 4800.0 0.02
1097 NaN 42.0 0.33
80808 NaN 4310.0 0.07

Concatenating rows

  • with axis=0, pd.concat is the same as population.append(unemployment, sort=True)
In [49]:
pd.concat([population, unemployment], axis=0, sort=True)
Out[49]:
2010 Census Population participants unemployment
57538 322.0 NaN NaN
59916 130.0 NaN NaN
37660 40038.0 NaN NaN
2860 45199.0 NaN NaN
2860 NaN 34447.0 0.11
46167 NaN 4800.0 0.02
1097 NaN 42.0 0.33
80808 NaN 4310.0 0.07

Concatenating column

  • outer join
In [50]:
pd.concat([population, unemployment], axis=1, sort=True)
Out[50]:
2010 Census Population unemployment participants
1097 NaN 0.33 42.0
2860 45199.0 0.11 34447.0
37660 40038.0 NaN NaN
46167 NaN 0.02 4800.0
57538 322.0 NaN NaN
59916 130.0 NaN NaN
80808 NaN 0.07 4310.0
In [51]:
del pop1_data, pop2_data, pop1, pop2, pop_data, emp_data, population, unemployment

Exercises

Appending DataFrames with ignore_index

In this exercise, you'll use the Baby Names Dataset (from data.gov) again. This time, both DataFrames names_1981 and names_1881 are loaded without specifying an Index column (so the default Indexes for both are RangeIndexes).

You'll use the DataFrame .append() method to make a DataFrame combined_names. To distinguish rows from the original two DataFrames, you'll add a 'year' column to each with the year (1881 or 1981 in this case). In addition, you'll specify ignore_index=True so that the index values are not used along the concatenation axis. The resulting axis will instead be labeled 0, 1, ..., n-1, which is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.

Instructions

  • Create a 'year' column in the DataFrames names_1881 and names_1981, with values of 1881 and 1981 respectively. Recall that assigning a scalar value to a DataFrame column broadcasts that value throughout.
  • Create a new DataFrame called combined_names by appending the rows of names_1981 underneath the rows of names_1881. Specify the keyword argument ignore_index=True to make a new RangeIndex of unique integers for each row.
  • Print the shapes of all three DataFrames. This has been done for you.
  • Extract all rows from combined_names that have the name 'Morgan'. To do this, use the .loc[] accessor with an appropriate filter. The relevant column of combined_names here is 'name'.
In [52]:
names_1881 = pd.read_csv(baby_1881_file, header=None, names=['name', 'gender', 'count'])
names_1981 = pd.read_csv(baby_1981_file, header=None, names=['name', 'gender', 'count'])
In [53]:
names_1981.head()
Out[53]:
name gender count
0 Jennifer F 57032
1 Jessica F 42519
2 Amanda F 34370
3 Sarah F 28162
4 Melissa F 28003
In [54]:
# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981

# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = names_1881.append(names_1981, ignore_index=True, sort=False)

# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)

# Print all rows that contain the name 'Morgan'
combined_names[combined_names.name  == 'Morgan']
(19455, 4)
(1935, 4)
(21390, 4)
Out[54]:
name gender count year
1283 Morgan M 23 1881
2096 Morgan F 1769 1981
14390 Morgan M 766 1981

Concatenating pandas DataFrames along column axis

The function pd.concat() can concatenate DataFrames horizontally as well as vertically (vertical is the default). To make the DataFrames stack horizontally, you have to specify the keyword argument axis=1 oraxis='columns'.

In this exercise, you'll use weather data with maximum and mean daily temperatures sampled at different rates (quarterly versus monthly). You'll concatenate the rows of both and see that, where rows are missing in the coarser DataFrame, null values are inserted in the concatenated DataFrame. This corresponds to an outer join (which you will explore in more detail in later exercises).

The files 'quarterly_max_temp.csv' and 'monthly_mean_temp.csv' have been pre-loaded into the DataFrames weather_max and weather_mean respectively, and pandas has been imported as pd.

Instructions

  • Create a new DataFrame called weather by concatenating the DataFrames weather_max and weather_mean horizontally.
    • Pass the DataFrames to pd.concat() as a list and specify the keyword argument axis=1 to stack them horizontally.
  • Print the new DataFrame weather.
In [55]:
weather_mean_data = {'Mean TemperatureF': [53.1, 70., 34.93548387, 28.71428571, 32.35483871, 72.87096774, 70.13333333, 35., 62.61290323, 39.8, 55.4516129 , 63.76666667],
                     'Month': ['Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep']}
weather_max_data = {'Max TemperatureF': [68, 89, 91, 84], 'Month': ['Jan', 'Apr', 'Jul', 'Oct']}

weather_mean = pd.DataFrame.from_dict(weather_mean_data)
weather_mean.set_index('Month', inplace=True, drop=True)
weather_max = pd.DataFrame.from_dict(weather_max_data)
weather_max.set_index('Month', inplace=True, drop=True)
In [56]:
weather_max
Out[56]:
Max TemperatureF
Month
Jan 68
Apr 89
Jul 91
Oct 84
In [57]:
weather_mean
Out[57]:
Mean TemperatureF
Month
Apr 53.100000
Aug 70.000000
Dec 34.935484
Feb 28.714286
Jan 32.354839
Jul 72.870968
Jun 70.133333
Mar 35.000000
May 62.612903
Nov 39.800000
Oct 55.451613
Sep 63.766667
In [58]:
# Concatenate weather_max and weather_mean horizontally: weather
weather = pd.concat([weather_max, weather_mean], axis=1, sort=True)

# Print weather
weather
Out[58]:
Max TemperatureF Mean TemperatureF
Apr 89.0 53.100000
Aug NaN 70.000000
Dec NaN 34.935484
Feb NaN 28.714286
Jan 68.0 32.354839
Jul 91.0 72.870968
Jun NaN 70.133333
Mar NaN 35.000000
May NaN 62.612903
Nov NaN 39.800000
Oct 84.0 55.451613
Sep NaN 63.766667

Reading multiple files to build a DataFrame

It is often convenient to build a large DataFrame by parsing many files as DataFrames and concatenating them all at once. You'll do this here with three files, but, in principle, this approach can be used to combine data from dozens or hundreds of files.

Here, you'll work with DataFrames compiled from The Guardian's Olympic medal dataset.

pandas has been imported as pd and two lists have been pre-loaded: An empty list called medals, and medal_types, which contains the strings 'bronze', 'silver', and 'gold'.

Instructions

  • Iterate over medal_types in the for loop.
  • Inside the for loop:
    • Create file_name using string interpolation with the loop variable medal. This has been done for you. The expression "%s_top5.csv" % medal evaluates as a string with the value of medal replacing %s in the format string.
    • Create the list of column names called columns. This has been done for you.
    • Read file_name into a DataFrame called medal_df. Specify the keyword arguments header=0, index_col='Country', and names=columns to get the correct row and column Indexes.
    • Append medal_df to medals using the list .append() method.
  • Concatenate the list of DataFrames medals horizontally (using axis='columns') to create a single DataFrame called medals. Print it in its entirety.
In [59]:
top_five = data.glob('*_top5.csv')
for file in top_five:
    print(file)
c:\Repositories\DataCamp\data\merging-dataframes-with-pandas\summer_olympics_bronze_top5.csv
c:\Repositories\DataCamp\data\merging-dataframes-with-pandas\summer_olympics_gold_top5.csv
c:\Repositories\DataCamp\data\merging-dataframes-with-pandas\summer_olympics_silver_top5.csv
In [60]:
medal_types = ['bronze', 'silver', 'gold']
medal_list = list()

for medal in medal_types:

    # Create the file name: file_name
    file_name = data / f'summer_olympics_{medal}_top5.csv'
    
    # Create list of column names: columns
    columns = ['Country', medal]
    
    # Read file_name into a DataFrame: df
    medal_df = pd.read_csv(file_name, header=0, index_col='Country', names=columns)

    # Append medal_df to medals
    medal_list.append(medal_df)

# Concatenate medals horizontally: medals
medals = pd.concat(medal_list, axis='columns', sort=True)

# Print medals
medals
Out[60]:
bronze silver gold
France 475.0 461.0 NaN
Germany 454.0 NaN 407.0
Italy NaN 394.0 460.0
Soviet Union 584.0 627.0 838.0
United Kingdom 505.0 591.0 498.0
United States 1052.0 1195.0 2088.0
In [61]:
del names_1881, names_1981, combined_names, weather_mean_data, weather_max_data, weather_mean, weather_max, weather, top_five, medals, medal_list

Concatenation, keys & MultiIndexes

Loading rainfall data

import pandas as pd
file1 = 'q1_rainfall_2013.csv'
rain2013 = pd.read_csv(file1, index_col='Month', parse_dates=True)
file2 = 'q1_rainfall_2014.csv'
rain2014 = pd.read_csv(file2, index_col='Month', parse_dates=True)
In [62]:
rain_2013_data = {'Month': ['Jan', 'Feb', 'Mar'], 'Precipitation': [0.096129, 0.067143, 0.061613]}
rain_2014_data = {'Month': ['Jan', 'Feb', 'Mar'], 'Precipitation': [0.050323, 0.082143, 0.070968]}

rain2013 = pd.DataFrame.from_dict(rain_2013_data)
rain2013.set_index('Month', inplace=True)
rain2014 = pd.DataFrame.from_dict(rain_2014_data)
rain2014.set_index('Month', inplace=True)

Examining rainfall data

In [63]:
rain2013
Out[63]:
Precipitation
Month
Jan 0.096129
Feb 0.067143
Mar 0.061613
In [64]:
rain2014
Out[64]:
Precipitation
Month
Jan 0.050323
Feb 0.082143
Mar 0.070968

Concatenating rows

In [65]:
pd.concat([rain2013, rain2014], axis=0)
Out[65]:
Precipitation
Month
Jan 0.096129
Feb 0.067143
Mar 0.061613
Jan 0.050323
Feb 0.082143
Mar 0.070968

Using multi-index on rows

In [66]:
rain1314 = pd.concat([rain2013, rain2014], keys=[2013, 2014], axis=0)
rain1314
Out[66]:
Precipitation
Month
2013 Jan 0.096129
Feb 0.067143
Mar 0.061613
2014 Jan 0.050323
Feb 0.082143
Mar 0.070968

Accessing a multi-index

In [67]:
rain1314.loc[2014]
Out[67]:
Precipitation
Month
Jan 0.050323
Feb 0.082143
Mar 0.070968

Concatenating columns

In [68]:
rain1314 = pd.concat([rain2013, rain2014], axis='columns')
rain1314
Out[68]:
Precipitation Precipitation
Month
Jan 0.096129 0.050323
Feb 0.067143 0.082143
Mar 0.061613 0.070968

Using a multi-index on columns

In [69]:
rain1314 = pd.concat([rain2013, rain2014], keys=[2013, 2014], axis='columns')
rain1314
Out[69]:
2013 2014
Precipitation Precipitation
Month
Jan 0.096129 0.050323
Feb 0.067143 0.082143
Mar 0.061613 0.070968
In [70]:
rain1314[2013]
Out[70]:
Precipitation
Month
Jan 0.096129
Feb 0.067143
Mar 0.061613

pd.concat() with dict

In [71]:
rain_dict = {2013: rain2013, 2014: rain2014}
rain1314 = pd.concat(rain_dict, axis='columns')
rain1314
Out[71]:
2013 2014
Precipitation Precipitation
Month
Jan 0.096129 0.050323
Feb 0.067143 0.082143
Mar 0.061613 0.070968
In [72]:
del rain_2013_data, rain_2014_data, rain2013, rain2014, rain1314

Exercises

Concatenating vertically to get MultiIndexed rows

When stacking a sequence of DataFrames vertically, it is sometimes desirable to construct a MultiIndex to indicate the DataFrame from which each row originated. This can be done by specifying the keys parameter in the call to pd.concat(), which generates a hierarchical index with the labels from keys as the outermost index label. So you don't have to rename the columns of each DataFrame as you load it. Instead, only the Index column needs to be specified.

Here, you'll continue working with DataFrames compiled from The Guardian's Olympic medal dataset. Once again, pandas has been imported as pd and two lists have been pre-loaded: An empty list called medals, and medal_types, which contains the strings 'bronze', 'silver', and 'gold'.

Instructions

  • Within the for loop:
    • Read file_name into a DataFrame called medal_df. Specify the index to be 'Country'.
    • Append medal_df to medals.
  • Concatenate the list of DataFrames medals into a single DataFrame called medals. Be sure to use the keyword argument keys=['bronze', 'silver', 'gold'] to create a vertically stacked DataFrame with a MultiIndex.
  • Print the new DataFrame medals. This has been done for you, so hit 'Submit Answer' to see the result!
In [73]:
medal_types = ['bronze', 'silver', 'gold']
medal_list = list()

for medal in medal_types:

    # Create the file name: file_name
    file_name = data / f'summer_olympics_{medal}_top5.csv'
    
    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(file_name, index_col='Country')
    
    # Append medal_df to medals
    medal_list.append(medal_df)
    
# Concatenate medals: medals
medals = pd.concat(medal_list, keys=['bronze', 'silver', 'gold'])

# Print medals in entirety
print(medals)
                        Total
       Country               
bronze United States   1052.0
       Soviet Union     584.0
       United Kingdom   505.0
       France           475.0
       Germany          454.0
silver United States   1195.0
       Soviet Union     627.0
       United Kingdom   591.0
       France           461.0
       Italy            394.0
gold   United States   2088.0
       Soviet Union     838.0
       United Kingdom   498.0
       Italy            460.0
       Germany          407.0

Slicing MultiIndexed DataFrames

This exercise picks up where the last ended (again using The Guardian's Olympic medal dataset).

You are provided with the MultiIndexed DataFrame as produced at the end of the preceding exercise. Your task is to sort the DataFrame and to use the pd.IndexSlice to extract specific slices. Check out this exercise from Manipulating DataFrames with pandas to refresh your memory on how to deal with MultiIndexed DataFrames.

pandas has been imported for you as pd and the DataFrame medals is already in your namespace.

Instructions

  • Create a new DataFrame medals_sorted with the entries of medals sorted. Use .sort_index(level=0) to ensure the Index is sorted suitably.
  • Print the number of bronze medals won by Germany and all of the silver medal data. This has been done for you.
  • Create an alias for pd.IndexSlice called idx. A slicer pd.IndexSlice is required when slicing on the inner level of a MultiIndex.
  • Slice all the data on medals won by the United Kingdom. To do this, use the .loc[] accessor with idx[:,'United Kingdom'], :.
In [74]:
# Sort the entries of medals: medals_sorted
medals_sorted = medals.sort_index(level=0)

# Print the number of Bronze medals won by Germany
print(medals_sorted.loc[('bronze','Germany')])
Total    454.0
Name: (bronze, Germany), dtype: float64
In [75]:
# Print data about silver medals
print(medals_sorted.loc['silver'])
                 Total
Country               
France           461.0
Italy            394.0
Soviet Union     627.0
United Kingdom   591.0
United States   1195.0
In [76]:
# Create alias for pd.IndexSlice: idx
idx = pd.IndexSlice

# Print all the data on medals won by the United Kingdom
medals_sorted.loc[idx[:, 'United Kingdom'], :]
Out[76]:
Total
Country
bronze United Kingdom 505.0
gold United Kingdom 498.0
silver United Kingdom 591.0

Concatenating horizontally to get MultiIndexed columns

It is also possible to construct a DataFrame with hierarchically indexed columns. For this exercise, you'll start with pandas imported and a list of three DataFrames called dataframes. All three DataFrames contain 'Company', 'Product', and 'Units' columns with a 'Date' column as the index pertaining to sales transactions during the month of February, 2015. The first DataFrame describes Hardware transactions, the second describes Software transactions, and the third, Service transactions.

Your task is to concatenate the DataFrames horizontally and to create a MultiIndex on the columns. From there, you can summarize the resulting DataFrame and slice some information from it.

Instructions

  • Construct a new DataFrame february with MultiIndexed columns by concatenating the list dataframes.
  • Use axis=1 to stack the DataFrames horizontally and the keyword argument keys=['Hardware', 'Software', 'Service'] to construct a hierarchical Index from each DataFrame.
  • Print summary information from the new DataFrame february using the .info() method. This has been done for you.
  • Create an alias called idx for pd.IndexSlice.
  • Extract a slice called slice_2_8 from february (using .loc[] & idx) that comprises rows between Feb. 2, 2015 to Feb. 8, 2015 from columns under 'Company'.
  • Print the slice_2_8. This has been done for you, so hit 'Submit Answer' to see the sliced data!
In [77]:
hw = pd.read_csv(sales_feb_hardware_file, index_col='Date')
sw = pd.read_csv(sales_feb_software_file, index_col='Date')
sv = pd.read_csv(sales_feb_service_file, index_col='Date')

dataframes = [hw, sw, sv]
dataframes
Out[77]:
[                             Company   Product  Units
 Date                                                 
 2015-02-04 21:52:45  Acme Coporation  Hardware     14
 2015-02-07 22:58:10  Acme Coporation  Hardware      1
 2015-02-19 10:59:33        Mediacore  Hardware     16
 2015-02-02 20:54:49        Mediacore  Hardware      9
 2015-02-21 20:41:47            Hooli  Hardware      3,
                              Company   Product  Units
 Date                                                 
 2015-02-16 12:09:19            Hooli  Software     10
 2015-02-03 14:14:18          Initech  Software     13
 2015-02-02 08:33:01            Hooli  Software      3
 2015-02-05 01:53:06  Acme Coporation  Software     19
 2015-02-11 20:03:08          Initech  Software      7
 2015-02-09 13:09:55        Mediacore  Software      7
 2015-02-11 22:50:44            Hooli  Software      4
 2015-02-04 15:36:29        Streeplex  Software     13
 2015-02-21 05:01:26        Mediacore  Software      3,
                        Company  Product  Units
 Date                                          
 2015-02-26 08:57:45  Streeplex  Service      4
 2015-02-25 00:29:00    Initech  Service     10
 2015-02-09 08:57:30  Streeplex  Service     19
 2015-02-26 08:58:51  Streeplex  Service      1
 2015-02-05 22:05:03      Hooli  Service     10
 2015-02-19 16:02:58  Mediacore  Service     10]
In [78]:
# Concatenate dataframes: february
february = pd.concat(dataframes, axis=1, keys=['Hardware', 'Software', 'Service'], sort=True)

# Print february.info()
february.info()
<class 'pandas.core.frame.DataFrame'>
Index: 20 entries, 2015-02-02 08:33:01 to 2015-02-26 08:58:51
Data columns (total 9 columns):
(Hardware, Company)    5 non-null object
(Hardware, Product)    5 non-null object
(Hardware, Units)      5 non-null float64
(Software, Company)    9 non-null object
(Software, Product)    9 non-null object
(Software, Units)      9 non-null float64
(Service, Company)     6 non-null object
(Service, Product)     6 non-null object
(Service, Units)       6 non-null float64
dtypes: float64(3), object(6)
memory usage: 1.6+ KB
In [79]:
february
Out[79]:
Hardware Software Service
Company Product Units Company Product Units Company Product Units
2015-02-02 08:33:01 NaN NaN NaN Hooli Software 3.0 NaN NaN NaN
2015-02-02 20:54:49 Mediacore Hardware 9.0 NaN NaN NaN NaN NaN NaN
2015-02-03 14:14:18 NaN NaN NaN Initech Software 13.0 NaN NaN NaN
2015-02-04 15:36:29 NaN NaN NaN Streeplex Software 13.0 NaN NaN NaN
2015-02-04 21:52:45 Acme Coporation Hardware 14.0 NaN NaN NaN NaN NaN NaN
2015-02-05 01:53:06 NaN NaN NaN Acme Coporation Software 19.0 NaN NaN NaN
2015-02-05 22:05:03 NaN NaN NaN NaN NaN NaN Hooli Service 10.0
2015-02-07 22:58:10 Acme Coporation Hardware 1.0 NaN NaN NaN NaN NaN NaN
2015-02-09 08:57:30 NaN NaN NaN NaN NaN NaN Streeplex Service 19.0
2015-02-09 13:09:55 NaN NaN NaN Mediacore Software 7.0 NaN NaN NaN
2015-02-11 20:03:08 NaN NaN NaN Initech Software 7.0 NaN NaN NaN
2015-02-11 22:50:44 NaN NaN NaN Hooli Software 4.0 NaN NaN NaN
2015-02-16 12:09:19 NaN NaN NaN Hooli Software 10.0 NaN NaN NaN
2015-02-19 10:59:33 Mediacore Hardware 16.0 NaN NaN NaN NaN NaN NaN
2015-02-19 16:02:58 NaN NaN NaN NaN NaN NaN Mediacore Service 10.0
2015-02-21 05:01:26 NaN NaN NaN Mediacore Software 3.0 NaN NaN NaN
2015-02-21 20:41:47 Hooli Hardware 3.0 NaN NaN NaN NaN NaN NaN
2015-02-25 00:29:00 NaN NaN NaN NaN NaN NaN Initech Service 10.0
2015-02-26 08:57:45 NaN NaN NaN NaN NaN NaN Streeplex Service 4.0
2015-02-26 08:58:51 NaN NaN NaN NaN NaN NaN Streeplex Service 1.0
In [80]:
# Assign pd.IndexSlice: idx
idx = pd.IndexSlice

# Create the slice: slice_2_8
slice_2_8 = february.loc['2015-02-02':'2015-02-08', idx[:, 'Company']]

# Print slice_2_8
slice_2_8
Out[80]:
Hardware Software Service
Company Company Company
2015-02-02 08:33:01 NaN Hooli NaN
2015-02-02 20:54:49 Mediacore NaN NaN
2015-02-03 14:14:18 NaN Initech NaN
2015-02-04 15:36:29 NaN Streeplex NaN
2015-02-04 21:52:45 Acme Coporation NaN NaN
2015-02-05 01:53:06 NaN Acme Coporation NaN
2015-02-05 22:05:03 NaN NaN Hooli
2015-02-07 22:58:10 Acme Coporation NaN NaN

Concatenating DataFrames from a dict

You're now going to revisit the sales data you worked with earlier in the chapter. Three DataFrames jan, feb, and mar have been pre-loaded for you. Your task is to aggregate the sum of all sales over the 'Company' column into a single DataFrame. You'll do this by constructing a dictionary of these DataFrames and then concatenating them.

Instructions

  • Create a list called month_list consisting of the tuples ('january', jan), ('february', feb), and ('march', mar).
  • Create an empty dictionary called month_dict.
  • Inside the for loop:
    • Group month_data by 'Company' and use .sum() to aggregate.
  • Construct a new DataFrame called sales by concatenating the DataFrames stored in month_dict.
  • Create an alias for pd.IndexSlice and print all sales by 'Mediacore'. This has been done for you, so hit 'Submit Answer' to see the result!
In [81]:
jan = pd.read_csv(sales_jan_2015_file)
feb = pd.read_csv(sales_feb_2015_file)
mar = pd.read_csv(sales_mar_2015_file)
In [82]:
mar
Out[82]:
Date Company Product Units
0 2015-03-22 14:42:25 Mediacore Software 6
1 2015-03-12 18:33:06 Initech Service 19
2 2015-03-22 03:58:28 Streeplex Software 8
3 2015-03-15 00:53:12 Hooli Hardware 19
4 2015-03-17 19:25:37 Hooli Hardware 10
5 2015-03-16 05:54:06 Mediacore Software 3
6 2015-03-25 10:18:10 Initech Hardware 9
7 2015-03-25 16:42:42 Streeplex Hardware 12
8 2015-03-26 05:20:04 Streeplex Software 3
9 2015-03-06 10:11:45 Mediacore Software 17
10 2015-03-22 21:14:39 Initech Hardware 11
11 2015-03-17 19:38:12 Hooli Hardware 8
12 2015-03-28 19:20:38 Acme Coporation Service 5
13 2015-03-13 04:41:32 Streeplex Hardware 8
14 2015-03-06 02:03:56 Mediacore Software 17
15 2015-03-13 11:40:16 Initech Software 11
16 2015-03-27 08:29:45 Mediacore Software 6
17 2015-03-21 06:42:41 Mediacore Hardware 19
18 2015-03-15 08:50:45 Initech Hardware 18
19 2015-03-13 16:25:24 Streeplex Software 9
In [83]:
# Make the list of tuples: month_list
month_list = [('january', jan), ('february', feb), ('march', mar)]

# Create an empty dictionary: month_dict
month_dict = dict()

for month_name, month_data in month_list:

    # Group month_data: month_dict[month_name]
    month_dict[month_name] = month_data.groupby(['Company']).sum()

# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)

# Print sales
print(sales)

# Print all sales by Mediacore
idx = pd.IndexSlice
print(sales.loc[idx[:, 'Mediacore'], :])
                          Units
         Company               
february Acme Coporation     34
         Hooli               30
         Initech             30
         Mediacore           45
         Streeplex           37
january  Acme Coporation     76
         Hooli               70
         Initech             37
         Mediacore           15
         Streeplex           50
march    Acme Coporation      5
         Hooli               37
         Initech             68
         Mediacore           68
         Streeplex           40
                    Units
         Company         
february Mediacore     45
january  Mediacore     15
march    Mediacore     68
In [84]:
del medal_types, medal_list, medal_df, medals, medals_sorted, idx, hw, sw, sv, dataframes, february, slice_2_8

Outer & inner joins

Using with arrays

In [85]:
A = np.arange(8).reshape(2, 4) + 0.1
A
Out[85]:
array([[0.1, 1.1, 2.1, 3.1],
       [4.1, 5.1, 6.1, 7.1]])
In [86]:
B = np.arange(6).reshape(2,3) + 0.2
B
Out[86]:
array([[0.2, 1.2, 2.2],
       [3.2, 4.2, 5.2]])
In [87]:
C = np.arange(12).reshape(3,4) + 0.3
C
Out[87]:
array([[ 0.3,  1.3,  2.3,  3.3],
       [ 4.3,  5.3,  6.3,  7.3],
       [ 8.3,  9.3, 10.3, 11.3]])

Stacking arrays horizontally

In [88]:
np.hstack([B, A])
Out[88]:
array([[0.2, 1.2, 2.2, 0.1, 1.1, 2.1, 3.1],
       [3.2, 4.2, 5.2, 4.1, 5.1, 6.1, 7.1]])
In [89]:
np.concatenate([B, A], axis=1)
Out[89]:
array([[0.2, 1.2, 2.2, 0.1, 1.1, 2.1, 3.1],
       [3.2, 4.2, 5.2, 4.1, 5.1, 6.1, 7.1]])

Stacking arrays vertically

In [90]:
np.vstack([A, C])
Out[90]:
array([[ 0.1,  1.1,  2.1,  3.1],
       [ 4.1,  5.1,  6.1,  7.1],
       [ 0.3,  1.3,  2.3,  3.3],
       [ 4.3,  5.3,  6.3,  7.3],
       [ 8.3,  9.3, 10.3, 11.3]])
In [91]:
np.concatenate([A, C], axis=0)
Out[91]:
array([[ 0.1,  1.1,  2.1,  3.1],
       [ 4.1,  5.1,  6.1,  7.1],
       [ 0.3,  1.3,  2.3,  3.3],
       [ 4.3,  5.3,  6.3,  7.3],
       [ 8.3,  9.3, 10.3, 11.3]])

Incompatible array dimensions

In [92]:
np.concatenate([A, B], axis=0) # incompatible columns
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-92-b2e676149fb0> in <module>
----> 1 np.concatenate([A, B], axis=0) # incompatible columns

ValueError: all the input array dimensions except for the concatenation axis must match exactly
In [93]:
np.concatenate([A, C], axis=1) # incompatible rows
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-93-4817b1d23978> in <module>
----> 1 np.concatenate([A, C], axis=1) # incompatible rows

ValueError: all the input array dimensions except for the concatenation axis must match exactly

Population & unemployment data

population = pd.read_csv('population_00.csv', index_col=0)

unemployment = pd.read_csv('unemployment_00.csv', index_col=0)
print(population)
2010 Census Population
Zip Code ZCTA
57538 322
59916 130
37660 40038
2860 45199

print(unemployment)
unemployment participants
Zip
2860 0.11 34447
46167 0.02 4800
1097 0.33 42
80808 0.07 4310

Converting to arrays

population_array = np.array(population)
print(population_array) # Index info is lost
[[ 322]
[ 130]
[40038]
[45199]]

unemployment_array = np.array(unemployment)
print(population_array)
[[ 1.10000000e-01 3.44470000e+04]
[ 2.00000000e-02 4.80000000e+03]
[ 3.30000000e-01 4.20000000e+01]
[ 7.00000000e-02 4.31000000e+03]]

Manipulating data as arrays

print(np.concatenate([population_array, unemployment_array], axis=1))
[[ 3.22000000e+02 1.10000000e-01 3.44470000e+04]
[ 1.30000000e+02 2.00000000e-02 4.80000000e+03]
[ 4.00380000e+04 3.30000000e-01 4.20000000e+01]
[ 4.51990000e+04 7.00000000e-02 4.31000000e+03]]

Joins

  • Joining tables: Combining rows of multiple tables
  • Outer join
    • Union of index sets (all labels, no repetition)
    • Missing fields filled with NaN
    • Preserves the indices in the original tables, filling null values for missing rows
    • Has all the indices of the original tables without repetiton (like a set union)
  • Inner join
    • Intersection of index sets (only common labels)
    • Has only labels common to both tables (like a set intersection)

Concatenation & inner join

  • only the row label present in both DataFrames is preserved
pd.concat([population, unemployment], axis=1, join='inner')

2010 Census Population unemployment participants
2860 45199 0.11 34447

Concatenation & outer join

  • All row indiecs from the original two indexes exist in the joind DataFrame index.
  • When a row occurs in one DataFrame, but not in the other, the missing column entries are filled with null values
pd.concat([population, unemployment], axis=1, join='outer')

2010 Census Population unemployment participants
1097 NaN 0.33 42.0
2860 45199.0 0.11 34447.0
37660 40038.0 NaN NaN
46167 NaN 0.02 4800.0
57538 322.0 NaN NaN
59916 130.0 NaN NaN
80808 NaN 0.07 4310.0

Inner join on other axis

  • The resulting DataFrame is empty becasue no column index label appears in both population and unemployment
pd.concat([population, unemployment], join='inner', axis=0)

Empty DataFrame
Columns: []
Index: [2860, 46167, 1097, 80808, 57538, 59916, 37660, 2860]

Exercises

Concatenating DataFrames with inner join

Here, you'll continue working with DataFrames compiled from The Guardian's Olympic medal dataset.

The DataFrames bronze, silver, and gold have been pre-loaded for you.

Your task is to compute an inner join.

Instructions

  • Construct a list of DataFrames called medal_list with entries bronze, silver, and gold.
  • Concatenate medal_list horizontally with an inner join to create medals.
    • Use the keyword argument keys=['bronze', 'silver', 'gold'] to yield suitable hierarchical indexing.
    • Use axis=1 to get horizontal concatenation.
    • Use join='inner' to keep only rows that share common index labels.
  • Print the new DataFrame medals.
In [94]:
bronze = pd.read_csv(so_bronze5_file)
silver = pd.read_csv(so_silver_file)
gold = pd.read_csv(so_gold_file)
In [95]:
# Create the list of DataFrames: medal_list
medal_list = [bronze, silver, gold]

# Concatenate medal_list horizontally using an inner join: medals
medals = pd.concat(medal_list, keys=['bronze', 'silver', 'gold'], axis=1, join='inner')

# Print medals
medals
Out[95]:
bronze silver gold
Country Total NOC Country Total NOC Country Total
0 United States 1052.0 USA United States 1195.0 USA United States 2088.0
1 Soviet Union 584.0 URS Soviet Union 627.0 URS Soviet Union 838.0
2 United Kingdom 505.0 GBR United Kingdom 591.0 GBR United Kingdom 498.0
3 France 475.0 FRA France 461.0 FRA France 378.0
4 Germany 454.0 GER Germany 350.0 GER Germany 407.0

Resampling & concatenating DataFrames with inner join

In this exercise, you'll compare the historical 10-year GDP (Gross Domestic Product) growth in the US and in China. The data for the US starts in 1947 and is recorded quarterly; by contrast, the data for China starts in 1961 and is recorded annually.

You'll need to use a combination of resampling and an inner join to align the index labels. You'll need an appropriate offset alias for resampling, and the method .resample() must be chained with some kind of aggregation method (.pct_change() and .last() in this case).

pandas has been imported as pd, and the DataFrames china and us have been pre-loaded, with the output of china.head() and us.head() printed in the IPython Shell.

Instructions

  • Make a new DataFrame china_annual by resampling the DataFrame china with .resample('A').last() (i.e., with annual frequency) and chaining two method calls:
  • Chain .pct_change(10) as an aggregation method to compute the percentage change with an offset of ten years.
  • Chain .dropna() to eliminate rows containing null values.
  • Make a new DataFrame us_annual by resampling the DataFrame us exactly as you resampled china.
  • Concatenate china_annual and us_annual to construct a DataFrame called gdp. Use join='inner' to perform an inner join and use axis=1 to concatenate horizontally.
  • Print the result of resampling gdp every decade (i.e., using .resample('10A')) and aggregating with the method .last(). This has been done for you, so hit 'Submit Answer' to see the result!
In [96]:
china = pd.read_csv(gdp_china_file, parse_dates=['Year'])
china.rename(columns={'GDP': 'China'}, inplace=True)
china.set_index('Year', inplace=True)

us = pd.read_csv(gdp_usa_file, parse_dates=['DATE'])
us.rename(columns={'DATE': 'Year', 'VALUE': 'US'}, inplace=True)
us.set_index('Year', inplace=True)
In [97]:
china.head()
Out[97]:
China
Year
1960-01-01 59.184116
1961-01-01 49.557050
1962-01-01 46.685179
1963-01-01 50.097303
1964-01-01 59.062255
In [98]:
us.head()
Out[98]:
US
Year
1947-01-01 243.1
1947-04-01 246.3
1947-07-01 250.1
1947-10-01 260.3
1948-01-01 266.2
In [99]:
# Resample and tidy china: china_annual
china_annual = china.resample('A').last().pct_change(10).dropna()
china_annual.head()
Out[99]:
China
Year
1970-12-31 0.546128
1971-12-31 0.988860
1972-12-31 1.402472
1973-12-31 1.730085
1974-12-31 1.408556
In [100]:
# Resample and tidy us: us_annual
us_annual = us.resample('A').last().pct_change(10).dropna()
us_annual.head()
Out[100]:
US
Year
1957-12-31 0.827507
1958-12-31 0.782686
1959-12-31 0.953137
1960-12-31 0.689354
1961-12-31 0.630959
In [101]:
# Concatenate china_annual and us_annual: gdp
gdp = pd.concat([china_annual, us_annual], join='inner', axis=1)

# Resample gdp and print
gdp.resample('10A').last()
Out[101]:
China US
Year
1970-12-31 0.546128 1.017187
1980-12-31 1.072537 1.742556
1990-12-31 0.892820 1.012126
2000-12-31 2.357522 0.738632
2010-12-31 4.011081 0.454332
2020-12-31 3.789936 0.361780
In [102]:
del bronze, silver, gold, medal_list, medals, china, us, china_annual, us_annual, gdp

Merging Data

Here, you'll learn all about merging pandas DataFrames. You'll explore different techniques for merging, and learn about left joins, right joins, inner joins, and outer joins, as well as when to use which. You'll also learn about ordered merging, which is useful when you want to merge DataFrames whose columns have natural orderings, like date-time columns.

  • merge() extends concat() with the ability to align rows using multiple columns

Merging DataFrames

In [103]:
pa_zipcode_population = {'Zipcode': [16855, 15681, 18657, 17307, 15635],
                         '2010 Census Population': [282, 5241, 11985, 5899, 220]}
pa_zipcode_city = {'Zipcode': [17545,18455, 17307, 15705, 16833, 16220, 18618, 16855, 16623, 15635, 15681, 18657, 15279, 17231, 18821],
                   'City': ['MANHEIM', 'PRESTON PARK', 'BIGLERVILLE', 'INDIANA', 'CURWENSVILLE', 'CROWN', 'HARVEYS LAKE', 'MINERAL SPRINGS',
                            'CASSVILLE', 'HANNASTOWN', 'SALTSBURG', 'TUNKHANNOCK', 'PITTSBURG', 'LEMASTERS', 'GREAT BEND'],
                   'State': ['PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA', 'PA']}

Population DataFrame

In [104]:
population = pd.DataFrame.from_dict(pa_zipcode_population)
population
Out[104]:
Zipcode 2010 Census Population
0 16855 282
1 15681 5241
2 18657 11985
3 17307 5899
4 15635 220

Cities DataFrame

In [105]:
cities = pd.DataFrame.from_dict(pa_zipcode_city)
cities
Out[105]:
Zipcode City State
0 17545 MANHEIM PA
1 18455 PRESTON PARK PA
2 17307 BIGLERVILLE PA
3 15705 INDIANA PA
4 16833 CURWENSVILLE PA
5 16220 CROWN PA
6 18618 HARVEYS LAKE PA
7 16855 MINERAL SPRINGS PA
8 16623 CASSVILLE PA
9 15635 HANNASTOWN PA
10 15681 SALTSBURG PA
11 18657 TUNKHANNOCK PA
12 15279 PITTSBURG PA
13 17231 LEMASTERS PA
14 18821 GREAT BEND PA

Merging

  • pd.merge() computes a merge on ALL columns that occur in both DataFrames
    • in the following case, the common column is Zipcode
    • for any row in which the Zipcode entry in cities matches a row in population, a new row is made in the merfed DataFrame.
    • by default, this is an inner join
      • it's an inner join because it glues together only rows that match in the joining columns of BOTH DataFrames
In [106]:
pd.merge(population, cities)
Out[106]:
Zipcode 2010 Census Population City State
0 16855 282 MINERAL SPRINGS PA
1 15681 5241 SALTSBURG PA
2 18657 11985 TUNKHANNOCK PA
3 17307 5899 BIGLERVILLE PA
4 15635 220 HANNASTOWN PA

Medal DataFrames

In [107]:
bronze = pd.read_csv(so_bronze_file)
bronze.head()
Out[107]:
NOC Country Total
0 USA United States 1052.0
1 URS Soviet Union 584.0
2 GBR United Kingdom 505.0
3 FRA France 475.0
4 GER Germany 454.0
In [108]:
len(bronze)
Out[108]:
138
In [109]:
gold = pd.read_csv(so_gold_file)
gold.head()
Out[109]:
NOC Country Total
0 USA United States 2088.0
1 URS Soviet Union 838.0
2 GBR United Kingdom 498.0
3 FRA France 378.0
4 GER Germany 407.0
In [110]:
len(gold)
Out[110]:
138

Merging all columns

  • by default, pd.merge() uses all columns common to both DataFrames to merge
  • the rows of the merged DataFrame consist of all rows where the NOC, Country, and Totals columns are identical in both DataFrames
In [111]:
so_merge = pd.merge(bronze, gold)
so_merge.head()
Out[111]:
NOC Country Total
0 ESP Spain 92.0
1 IRL Ireland 8.0
2 SYR Syria 1.0
3 MOZ Mozambique 1.0
4 SUR Suriname 1.0
In [112]:
len(so_merge)
Out[112]:
18
In [113]:
so_merge.columns
Out[113]:
Index(['NOC', 'Country', 'Total'], dtype='object')
In [114]:
so_merge.index
Out[114]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], dtype='int64')

Merging on

In [115]:
so_merge = pd.merge(bronze, gold, on='NOC')
so_merge.head()
Out[115]:
NOC Country_x Total_x Country_y Total_y
0 USA United States 1052.0 United States 2088.0
1 URS Soviet Union 584.0 Soviet Union 838.0
2 GBR United Kingdom 505.0 United Kingdom 498.0
3 FRA France 475.0 France 378.0
4 GER Germany 454.0 Germany 407.0
In [116]:
len(so_merge)
Out[116]:
138

Merging on multiple columns

  • this is where merging extend concatenation in allowing matching on multiple columns
In [117]:
so_merge = pd.merge(bronze, gold, on=['NOC', 'Country'])
so_merge.head()
Out[117]:
NOC Country Total_x Total_y
0 USA United States 1052.0 2088.0
1 URS Soviet Union 584.0 838.0
2 GBR United Kingdom 505.0 498.0
3 FRA France 475.0 378.0
4 GER Germany 454.0 407.0

Using suffixes

In [118]:
so_merge = pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'])
so_merge.head()
Out[118]:
NOC Country Total_bronze Total_gold
0 USA United States 1052.0 2088.0
1 URS Soviet Union 584.0 838.0
2 GBR United Kingdom 505.0 498.0
3 FRA France 475.0 378.0
4 GER Germany 454.0 407.0

Counties DataFrame

In [119]:
pa_counties = {'CITY NAME': ['SALTSBURG', 'MINERAL SPRINGS', 'BIGLERVILLE', 'HANNASTOWN', 'TUNKHANNOCK'],
               'COUNTY NAME': ['INDIANA', 'CLEARFIELD', 'ADAMS', 'WESTMORELAND', 'WYOMING']}
counties = pd.DataFrame.from_dict(pa_counties)
counties
Out[119]:
CITY NAME COUNTY NAME
0 SALTSBURG INDIANA
1 MINERAL SPRINGS CLEARFIELD
2 BIGLERVILLE ADAMS
3 HANNASTOWN WESTMORELAND
4 TUNKHANNOCK WYOMING
In [120]:
cities.tail()
Out[120]:
Zipcode City State
10 15681 SALTSBURG PA
11 18657 TUNKHANNOCK PA
12 15279 PITTSBURG PA
13 17231 LEMASTERS PA
14 18821 GREAT BEND PA

Specifying columns to merge

In [121]:
pd.merge(counties, cities, left_on='CITY NAME', right_on='City')
Out[121]:
CITY NAME COUNTY NAME Zipcode City State
0 SALTSBURG INDIANA 15681 SALTSBURG PA
1 MINERAL SPRINGS CLEARFIELD 16855 MINERAL SPRINGS PA
2 BIGLERVILLE ADAMS 17307 BIGLERVILLE PA
3 HANNASTOWN WESTMORELAND 15635 HANNASTOWN PA
4 TUNKHANNOCK WYOMING 18657 TUNKHANNOCK PA

Switching left/right DataFrames

In [122]:
pd.merge(cities, counties, left_on='City', right_on='CITY NAME')
Out[122]:
Zipcode City State CITY NAME COUNTY NAME
0 17307 BIGLERVILLE PA BIGLERVILLE ADAMS
1 16855 MINERAL SPRINGS PA MINERAL SPRINGS CLEARFIELD
2 15635 HANNASTOWN PA HANNASTOWN WESTMORELAND
3 15681 SALTSBURG PA SALTSBURG INDIANA
4 18657 TUNKHANNOCK PA TUNKHANNOCK WYOMING
In [123]:
del pa_zipcode_population, pa_zipcode_city, population, cities, bronze, gold, so_merge, pa_counties, counties

Exercises

Merging company DataFrames

Suppose your company has operations in several different cities under several different managers. The DataFrames revenue and managers contain partial information related to the company. That is, the rows of the city columns don't quite match in revenue and managers (the Mendocino branch has no revenue yet since it just opened and the manager of Springfield branch recently left the company).

The DataFrames have been printed in the IPython Shell. If you were to run the command combined = pd.merge(revenue, managers, on='city'), how many rows would combined have?

In [124]:
rev = {'city': ['Austin', 'Denver', 'Springfield'], 'revenue': [100, 83, 4]}
man = {'city': ['Austin', 'Denver', 'Mendocino'], 'manager': ['Charles', 'Joel', 'Brett']}

revenue = pd.DataFrame.from_dict(rev)
managers = pd.DataFrame.from_dict(man)
In [125]:
combined = pd.merge(revenue, managers, on='city')
combined
Out[125]:
city revenue manager
0 Austin 100 Charles
1 Denver 83 Joel

Merging on a specific column

This exercise follows on the last one with the DataFrames revenue and managers for your company. You expect your company to grow and, eventually, to operate in cities with the same name on different states. As such, you decide that every branch should have a numerical branch identifier. Thus, you add a branch_id column to both DataFrames. Moreover, new cities have been added to both the revenue and managers DataFrames as well. pandas has been imported as pd and both DataFrames are available in your namespace.

At present, there should be a 1-to-1 relationship between the city and branch_id fields. In that case, the result of a merge on the city columns ought to give you the same output as a merge on the branch_id columns. Do they? Can you spot an ambiguity in one of the DataFrames?

Instructions

  • Using pd.merge(), merge the DataFrames revenue and managers on the 'city' column of each. Store the result as merge_by_city.
  • Print the DataFrame merge_by_city. This has been done for you.
  • Merge the DataFrames revenue and managers on the 'branch_id' column of each. Store the result as merge_by_id.
  • Print the DataFrame merge_by_id. This has been done for you, so hit 'Submit Answer' to see the result!
In [126]:
rev = {'city': ['Austin', 'Denver', 'Springfield', 'Mendocino'], 'revenue': [100, 83, 4, 200], 'branch_id': [10, 20, 30, 47]}
man = {'city': ['Austin', 'Denver', 'Mendocino', 'Springfield'], 'manager': ['Charles', 'Joel', 'Brett', 'Sally'], 'branch_id': [10, 20, 47, 31]}

revenue = pd.DataFrame.from_dict(rev)
managers = pd.DataFrame.from_dict(man)
In [127]:
# Merge revenue with managers on 'city': merge_by_city
merge_by_city = pd.merge(revenue, managers, on='city')

# Print merge_by_city
merge_by_city
Out[127]:
city revenue branch_id_x manager branch_id_y
0 Austin 100 10 Charles 10
1 Denver 83 20 Joel 20
2 Springfield 4 30 Sally 31
3 Mendocino 200 47 Brett 47
In [128]:
# Merge revenue with managers on 'branch_id': merge_by_id
merge_by_id = pd.merge(revenue, managers, on='branch_id')

# Print merge_by_id
merge_by_id
Out[128]:
city_x revenue branch_id city_y manager
0 Austin 100 10 Austin Charles
1 Denver 83 20 Denver Joel
2 Mendocino 200 47 Mendocino Brett

Notice that when you merge on 'city', the resulting DataFrame has a peculiar result: In row 2, the city Springfield has two different branch IDs. This is because there are actually two different cities named Springfield - one in the State of Illinois, and the other in Missouri. The revenue DataFrame has the one from Illinois, and the managers DataFrame has the one from Missouri. Consequently, when you merge on 'branch_id', both of these get dropped from the merged DataFrame.

Merging on columns with non-matching labels

You continue working with the revenue & managers DataFrames from before. This time, someone has changed the field name 'city' to 'branch' in the managers table. Now, when you attempt to merge DataFrames, an exception is thrown:

>>> pd.merge(revenue, managers, on='city')
Traceback (most recent call last):
    ... <text deleted> ...
    pd.merge(revenue, managers, on='city')
    ... <text deleted> ...
KeyError: 'city'

Given this, it will take a bit more work for you to join or merge on the city/branch name. You have to specify the left_on and right_on parameters in the call to pd.merge().

As before, pandas has been pre-imported as pd and the revenue and managers DataFrames are in your namespace. They have been printed in the IPython Shell so you can examine the columns prior to merging.

Are you able to merge better than in the last exercise? How should the rows with Springfield be handled?

Instructions

  • Merge the DataFrames revenue and managers into a single DataFrame called combined using the 'city' and 'branch' columns from the appropriate DataFrames.
    • In your call to pd.merge(), you will have to specify the parameters left_on and right_on appropriately.
  • Print the new DataFrame combined.
In [129]:
state_rev = {'Austin': 'TX', 'Denver': 'CO', 'Springfield': 'IL', 'Mendocino': 'CA'}
state_man = {'Austin': 'TX', 'Denver': 'CO', 'Mendocino': 'CA', 'Springfield': 'MO'}
In [130]:
revenue['state'] = revenue['city'].map(state_rev)
managers['state'] = managers['city'].map(state_man)
In [131]:
managers.rename(columns={'city': 'branch'}, inplace=True)
In [132]:
revenue
Out[132]:
city revenue branch_id state
0 Austin 100 10 TX
1 Denver 83 20 CO
2 Springfield 4 30 IL
3 Mendocino 200 47 CA
In [133]:
managers
Out[133]:
branch manager branch_id state
0 Austin Charles 10 TX
1 Denver Joel 20 CO
2 Mendocino Brett 47 CA
3 Springfield Sally 31 MO
In [134]:
combined = pd.merge(revenue, managers, left_on='city', right_on='branch')
combined
Out[134]:
city revenue branch_id_x state_x branch manager branch_id_y state_y
0 Austin 100 10 TX Austin Charles 10 TX
1 Denver 83 20 CO Denver Joel 20 CO
2 Springfield 4 30 IL Springfield Sally 31 MO
3 Mendocino 200 47 CA Mendocino Brett 47 CA

Merging on multiple columns

Another strategy to disambiguate cities with identical names is to add information on the states in which the cities are located. To this end, you add a column called state to both DataFrames from the preceding exercises. Again, pandas has been pre-imported as pd and the revenue and managers DataFrames are in your namespace.

Your goal in this exercise is to use pd.merge() to merge DataFrames using multiple columns (using 'branch_id', 'city', and 'state' in this case).

Are you able to match all your company's branches correctly?

Instructions

  • Create a column called 'state' in the DataFrame revenue, consisting of the list ['TX','CO','IL','CA'].
  • Create a column called 'state' in the DataFrame managers, consisting of the list ['TX','CO','CA','MO'].
  • Merge the DataFrames revenue and managers using three columns :'branch_id', 'city', and 'state'. Pass them in as a list to the on paramater of pd.merge().
In [135]:
managers.rename(columns={'branch': 'city'}, inplace=True)
In [136]:
# Add 'state' column to revenue: revenue['state']
revenue['state'] = ['TX','CO','IL','CA']

# Add 'state' column to managers: managers['state']
managers['state'] = ['TX','CO','CA','MO']

# Merge revenue & managers on 'branch_id', 'city', & 'state': combined
combined = pd.merge(revenue, managers, on=['branch_id', 'city', 'state'])

# Print combined
print(combined)
        city  revenue  branch_id state  manager
0     Austin      100         10    TX  Charles
1     Denver       83         20    CO     Joel
2  Mendocino      200         47    CA    Brett
In [137]:
del rev, man, revenue, managers, merge_by_city, merge_by_id, combined

Joining DataFrames

  • Pandas has to search through DataFrame rows for matches when computing joins and merges
    • It's useful to have different kinds of joins to mitigate costs

Medal DataFrames

In [138]:
bronze = pd.read_csv(so_bronze_file)
bronze.head()
Out[138]:
NOC Country Total
0 USA United States 1052.0
1 URS Soviet Union 584.0
2 GBR United Kingdom 505.0
3 FRA France 475.0
4 GER Germany 454.0
In [139]:
len(bronze)
Out[139]:
138
In [140]:
gold = pd.read_csv(so_gold_file)
gold.head()
Out[140]:
NOC Country Total
0 USA United States 2088.0
1 URS Soviet Union 838.0
2 GBR United Kingdom 498.0
3 FRA France 378.0
4 GER Germany 407.0
In [141]:
len(gold)
Out[141]:
138

Merging with inner join

  • merge() does an inner join by default
    • it extracts the rows that match in joining columns from both DataFrames and it glues them together in the joined DataFrame
    • the property how=innner is the default behavior
In [142]:
so_merge = pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='inner')
so_merge.head()
Out[142]:
NOC Country Total_bronze Total_gold
0 USA United States 1052.0 2088.0
1 URS Soviet Union 584.0 838.0
2 GBR United Kingdom 505.0 498.0
3 FRA France 475.0 378.0
4 GER Germany 454.0 407.0

Merging with left join

  • using how=left keeps all rows of the left DataFrame in the merged DataFrame
  • Keeps all rows of the left DF in the merged DF
  • For rows in the left DF with matches in the right DF:
    • Non-joining columns of right DF are appended to left DF
  • For rows in the left DF with no matches in the right DF:
    • Non-joining columns are filled with nulls
In [143]:
bronze = pd.read_csv(so_bronze5_file)
gold = pd.read_csv(so_gold5_file)
In [144]:
g_noc = ['USA', 'URS', 'GBR', 'ITA', 'GER']
b_noc = ['USA', 'URS', 'GBR', 'FRA', 'GER']
In [145]:
gold['NOC'] = g_noc
bronze['NOC'] = b_noc
In [146]:
gold
Out[146]:
Country Total NOC
0 United States 2088.0 USA
1 Soviet Union 838.0 URS
2 United Kingdom 498.0 GBR
3 Italy 460.0 ITA
4 Germany 407.0 GER
In [147]:
bronze
Out[147]:
Country Total NOC
0 United States 1052.0 USA
1 Soviet Union 584.0 URS
2 United Kingdom 505.0 GBR
3 France 475.0 FRA
4 Germany 454.0 GER
In [148]:
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='left')
Out[148]:
Country Total_bronze NOC Total_gold
0 United States 1052.0 USA 2088.0
1 Soviet Union 584.0 URS 838.0
2 United Kingdom 505.0 GBR 498.0
3 France 475.0 FRA NaN
4 Germany 454.0 GER 407.0

Merging with right join

In [149]:
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='right')
Out[149]:
Country Total_bronze NOC Total_gold
0 United States 1052.0 USA 2088.0
1 Soviet Union 584.0 URS 838.0
2 United Kingdom 505.0 GBR 498.0
3 Germany 454.0 GER 407.0
4 Italy NaN ITA 460.0

Merging with outer join

In [150]:
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='outer')
Out[150]:
Country Total_bronze NOC Total_gold
0 United States 1052.0 USA 2088.0
1 Soviet Union 584.0 URS 838.0
2 United Kingdom 505.0 GBR 498.0
3 France 475.0 FRA NaN
4 Germany 454.0 GER 407.0
5 Italy NaN ITA 460.0

Population & unemployment data

In [151]:
population = pd.DataFrame.from_dict({'Zip Code ZCTA': [57538, 59916, 37660, 2860],
                                     '2010 Census Population': [322, 130, 40038, 45199]})
population.set_index('Zip Code ZCTA', inplace=True)
population
Out[151]:
2010 Census Population
Zip Code ZCTA
57538 322
59916 130
37660 40038
2860 45199
In [152]:
unemployment = pd.DataFrame.from_dict({'Zip': [2860, 46167, 1097],
                                       'unemployment': [0.11, 0.02, 0.33],
                                       'participants': [ 34447, 4800, 32]})
unemployment.set_index('Zip', inplace=True)
unemployment
Out[152]:
unemployment participants
Zip
2860 0.11 34447
46167 0.02 4800
1097 0.33 32

Using .join(how='left')

  • computes a left join using the Index by default
In [153]:
population.join(unemployment)
Out[153]:
2010 Census Population unemployment participants
Zip Code ZCTA
57538 322 NaN NaN
59916 130 NaN NaN
37660 40038 NaN NaN
2860 45199 0.11 34447.0

Using .join(how='right')

In [154]:
population.join(unemployment, how='right')
Out[154]:
2010 Census Population unemployment participants
Zip
2860 45199.0 0.11 34447
46167 NaN 0.02 4800
1097 NaN 0.33 32

Using .join(how='inner')

In [155]:
population.join(unemployment, how='inner')
Out[155]:
2010 Census Population unemployment participants
2860 45199 0.11 34447

Using .join(how='outer')

In [156]:
population.join(unemployment, how='outer')
Out[156]:
2010 Census Population unemployment participants
1097 NaN 0.33 32.0
2860 45199.0 0.11 34447.0
37660 40038.0 NaN NaN
46167 NaN 0.02 4800.0
57538 322.0 NaN NaN
59916 130.0 NaN NaN
In [157]:
del bronze, gold, so_merge, g_noc, b_noc, population, unemployment

Which should you use?

  • df1.append(df2): stacking vertically
  • pd.concat([df1, df2]):
    • stacking many horizontally or vertically
    • simple inner/outer joins on Indexes
  • df1.join(df2): inner/outer/left/right joins on Indexes
  • pd.merge([df1, df2]): many joins on multiple columns

Exercises

Data

In [158]:
rev = {'city': ['Austin', 'Denver', 'Springfield', 'Mendocino'],
       'state': ['TX','CO','IL','CA'],
       'revenue': [100, 83, 4, 200],
       'branch_id': [10, 20, 30, 47]}

man = {'city': ['Austin', 'Denver', 'Mendocino', 'Springfield'],
       'state': ['TX','CO','CA','MO'],
       'manager': ['Charles', 'Joel', 'Brett', 'Sally'],
       'branch_id': [10, 20, 47, 31]}

revenue = pd.DataFrame.from_dict(rev)
revenue.set_index('branch_id', inplace=True)
managers = pd.DataFrame.from_dict(man)
managers.set_index('branch_id', inplace=True)
In [159]:
revenue
Out[159]:
city state revenue
branch_id
10 Austin TX 100
20 Denver CO 83
30 Springfield IL 4
47 Mendocino CA 200
In [160]:
managers
Out[160]:
city state manager
branch_id
10 Austin TX Charles
20 Denver CO Joel
47 Mendocino CA Brett
31 Springfield MO Sally

Joining by Index

The DataFrames revenue and managers are displayed in the IPython Shell. Here, they are indexed by 'branch_id'.

Choose the function call below that will join the DataFrames on their indexes and return 5 rows with index labels [10, 20, 30, 31, 47]. Explore each of them in the IPython Shell to get a better understanding of their functionality.

In [161]:
revenue.join(managers, lsuffix='_rev', rsuffix='_mng', how='outer')
Out[161]:
city_rev state_rev revenue city_mng state_mng manager
branch_id
10 Austin TX 100.0 Austin TX Charles
20 Denver CO 83.0 Denver CO Joel
30 Springfield IL 4.0 NaN NaN NaN
31 NaN NaN NaN Springfield MO Sally
47 Mendocino CA 200.0 Mendocino CA Brett

Choosing a joining strategy

Suppose you have two DataFrames: students (with columns 'StudentID', 'LastName', 'FirstName', and 'Major') and midterm_results (with columns 'StudentID', 'Q1', 'Q2', and 'Q3' for their scores on midterm questions).

You want to combine the DataFrames into a single DataFrame grades, and be able to easily spot which students wrote the midterm and which didn't (their midterm question scores 'Q1', 'Q2', & 'Q3' should be filled with NaN values).

You also want to drop rows from midterm_results in which the StudentID is not found in students.

Which of the following strategies gives the desired result?

In [162]:
students = pd.DataFrame.from_dict({'StudentID': [], 'LastName': [], 'FirstName': [], 'Major': []})
midterm_results = pd.DataFrame.from_dict({'StudentID': [], 'Q1': [], 'Q2': [], 'Q3': []})
In [163]:
students
Out[163]:
StudentID LastName FirstName Major
In [164]:
midterm_results
Out[164]:
StudentID Q1 Q2 Q3
In [165]:
grades = pd.merge(students, midterm_results, how='left')

Left & right merging on multiple columns

You now have, in addition to the revenue and managers DataFrames from prior exercises, a DataFrame sales that summarizes units sold from specific branches (identified by city and state but not branch_id).

Once again, the managers DataFrame uses the label branch in place of city as in the other two DataFrames. Your task here is to employ left and right merges to preserve data and identify where data is missing.

By merging revenue and sales with a right merge, you can identify the missing revenue values. Here, you don't need to specify left_on or right_on because the columns to merge on have matching labels.

By merging sales and managers with a left merge, you can identify the missing manager. Here, the columns to merge on have conflicting labels, so you must specify left_on and right_on. In both cases, you're looking to figure out how to connect the fields in rows containing Springfield.

pandas has been imported as pd and the three DataFrames revenue, managers, and sales have been pre-loaded. They have been printed for you to explore in the IPython Shell.

Instructions

  • Execute a right merge using pd.merge() with revenue and sales to yield a new DataFrame revenue_and_sales.
    • Use how='right' and on=['city', 'state'].
  • Print the new DataFrame revenue_and_sales. This has been done for you.
  • Execute a left merge with sales and managers to yield a new DataFrame sales_and_managers.
    • Use how='left', left_on=['city', 'state'], and right_on=['branch', 'state'].
  • Print the new DataFrame sales_and_managers. This has been done for you, so hit 'Submit Answer' to see the result!
In [166]:
rev = {'city': ['Austin', 'Denver', 'Springfield', 'Mendocino'],
       'branch_id': [10, 20, 30, 47],
       'state': ['TX','CO','IL','CA'],
       'revenue': [100, 83, 4, 200]}

man = {'branch': ['Austin', 'Denver', 'Mendocino', 'Springfield'],
       'branch_id': [10, 20, 47, 31],
       'state': ['TX','CO','CA','MO'],
       'manager': ['Charles', 'Joel', 'Brett', 'Sally']}

sale = {'city': ['Mendocino', 'Denver', 'Austin', 'Springfield', 'Springfield'],
        'state': ['CA', 'CO', 'TX', 'MO', 'IL'],
        'units': [1, 4, 2, 5, 1]}

revenue = pd.DataFrame.from_dict(rev)
managers = pd.DataFrame.from_dict(man)
sales = pd.DataFrame.from_dict(sale)
In [167]:
revenue
Out[167]:
city branch_id state revenue
0 Austin 10 TX 100
1 Denver 20 CO 83
2 Springfield 30 IL 4
3 Mendocino 47 CA 200
In [168]:
managers
Out[168]:
branch branch_id state manager
0 Austin 10 TX Charles
1 Denver 20 CO Joel
2 Mendocino 47 CA Brett
3 Springfield 31 MO Sally
In [169]:
sales
Out[169]:
city state units
0 Mendocino CA 1
1 Denver CO 4
2 Austin TX 2
3 Springfield MO 5
4 Springfield IL 1
In [170]:
# Merge revenue and sales: revenue_and_sales
revenue_and_sales = pd.merge(revenue, sales, how='right', on=['city', 'state'])

# Print revenue_and_sales
revenue_and_sales
Out[170]:
city branch_id state revenue units
0 Austin 10.0 TX 100.0 2
1 Denver 20.0 CO 83.0 4
2 Springfield 30.0 IL 4.0 1
3 Mendocino 47.0 CA 200.0 1
4 Springfield NaN MO NaN 5
In [171]:
sales_and_managers = pd.merge(sales, managers, how='left', left_on=['city', 'state'], right_on=['branch', 'state'])

# Print sales_and_managers
sales_and_managers
Out[171]:
city state units branch branch_id manager
0 Mendocino CA 1 Mendocino 47.0 Brett
1 Denver CO 4 Denver 20.0 Joel
2 Austin TX 2 Austin 10.0 Charles
3 Springfield MO 5 Springfield 31.0 Sally
4 Springfield IL 1 NaN NaN NaN

Merging DataFrames with outer join

This exercise picks up where the previous one left off. The DataFrames revenue, managers, and sales are pre-loaded into your namespace (and, of course, pandas is imported as pd). Moreover, the merged DataFrames revenue_and_sales and sales_and_managers have been pre-computed exactly as you did in the previous exercise.

The merged DataFrames contain enough information to construct a DataFrame with 5 rows with all known information correctly aligned and each branch listed only once. You will try to merge the merged DataFrames on all matching keys (which computes an inner join by default). You can compare the result to an outer join and also to an outer join with restricted subset of columns as keys.

Instructions

  • Merge sales_and_managers with revenue_and_sales. Store the result as merge_default.
  • Print merge_default. This has been done for you.
  • Merge sales_and_managers with revenue_and_sales using how='outer'. Store the result as merge_outer.
  • Print merge_outer. This has been done for you.
  • Merge sales_and_managers with revenue_and_sales only on ['city','state'] using an outer join. Store the result as merge_outer_on and hit 'Submit Answer' to see what the merged DataFrames look like!
In [172]:
# Perform the first merge: merge_default
merge_default = pd.merge(sales_and_managers, revenue_and_sales)

# Print merge_default
merge_default
Out[172]:
city state units branch branch_id manager revenue
0 Mendocino CA 1 Mendocino 47.0 Brett 200.0
1 Denver CO 4 Denver 20.0 Joel 83.0
2 Austin TX 2 Austin 10.0 Charles 100.0
In [173]:
# Perform the second merge: merge_outer
merge_outer = pd.merge(sales_and_managers, revenue_and_sales, how='outer')

# Print merge_outer
merge_outer
Out[173]:
city state units branch branch_id manager revenue
0 Mendocino CA 1 Mendocino 47.0 Brett 200.0
1 Denver CO 4 Denver 20.0 Joel 83.0
2 Austin TX 2 Austin 10.0 Charles 100.0
3 Springfield MO 5 Springfield 31.0 Sally NaN
4 Springfield IL 1 NaN NaN NaN NaN
5 Springfield IL 1 NaN 30.0 NaN 4.0
6 Springfield MO 5 NaN NaN NaN NaN
In [174]:
# Perform the third merge: merge_outer_on
merge_outer_on = pd.merge(sales_and_managers, revenue_and_sales, on=['city', 'state'], how='outer')

# Print merge_outer_on
merge_outer_on
Out[174]:
city state units_x branch branch_id_x manager branch_id_y revenue units_y
0 Mendocino CA 1 Mendocino 47.0 Brett 47.0 200.0 1
1 Denver CO 4 Denver 20.0 Joel 20.0 83.0 4
2 Austin TX 2 Austin 10.0 Charles 10.0 100.0 2
3 Springfield MO 5 Springfield 31.0 Sally NaN NaN 5
4 Springfield IL 1 NaN NaN NaN 30.0 4.0 1
In [175]:
del rev, man, revenue, managers, students, midterm_results, grades, sale, sales, revenue_and_sales, sales_and_managers, merge_default, merge_outer, merge_outer_on

Ordered merges

Software & hardware sales

In [176]:
software = pd.read_csv(sales_feb_software_file, parse_dates=['Date']).sort_values('Date')
software.head(10)
Out[176]:
Date Company Product Units
2 2015-02-02 08:33:01 Hooli Software 3
1 2015-02-03 14:14:18 Initech Software 13
7 2015-02-04 15:36:29 Streeplex Software 13
3 2015-02-05 01:53:06 Acme Coporation Software 19
5 2015-02-09 13:09:55 Mediacore Software 7
4 2015-02-11 20:03:08 Initech Software 7
6 2015-02-11 22:50:44 Hooli Software 4
0 2015-02-16 12:09:19 Hooli Software 10
8 2015-02-21 05:01:26 Mediacore Software 3
In [177]:
hardware = pd.read_csv(sales_feb_hardware_file, parse_dates=['Date']).sort_values('Date')
hardware.head()
Out[177]:
Date Company Product Units
3 2015-02-02 20:54:49 Mediacore Hardware 9
0 2015-02-04 21:52:45 Acme Coporation Hardware 14
1 2015-02-07 22:58:10 Acme Coporation Hardware 1
2 2015-02-19 10:59:33 Mediacore Hardware 16
4 2015-02-21 20:41:47 Hooli Hardware 3

Using merge()

  • attempting to merge yields an empty DataFrame because it's doing an INNER join on all columns with matching names by defaults
    • 'Units' and 'Date' columns have no overlapping values, so the result is empty
In [178]:
sales_merge = pd.merge(hardware, software)
sales_merge
Out[178]:
Date Company Product Units
In [179]:
sales_merge.info()
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 4 columns):
Date       0 non-null datetime64[ns]
Company    0 non-null object
Product    0 non-null object
Units      0 non-null int64
dtypes: datetime64[ns](1), int64(1), object(2)
memory usage: 0.0+ bytes

Using merge(how='outer')

In [180]:
sales_merge = pd.merge(hardware, software, how='outer')
sales_merge.head(14)
Out[180]:
Date Company Product Units
0 2015-02-02 20:54:49 Mediacore Hardware 9
1 2015-02-04 21:52:45 Acme Coporation Hardware 14
2 2015-02-07 22:58:10 Acme Coporation Hardware 1
3 2015-02-19 10:59:33 Mediacore Hardware 16
4 2015-02-21 20:41:47 Hooli Hardware 3
5 2015-02-02 08:33:01 Hooli Software 3
6 2015-02-03 14:14:18 Initech Software 13
7 2015-02-04 15:36:29 Streeplex Software 13
8 2015-02-05 01:53:06 Acme Coporation Software 19
9 2015-02-09 13:09:55 Mediacore Software 7
10 2015-02-11 20:03:08 Initech Software 7
11 2015-02-11 22:50:44 Hooli Software 4
12 2015-02-16 12:09:19 Hooli Software 10
13 2015-02-21 05:01:26 Mediacore Software 3

Sorting merge(how='outer')

In [181]:
sales_merge = pd.merge(hardware, software, how='outer').sort_values('Date')
sales_merge.head(14)
Out[181]:
Date Company Product Units
5 2015-02-02 08:33:01 Hooli Software 3
0 2015-02-02 20:54:49 Mediacore Hardware 9
6 2015-02-03 14:14:18 Initech Software 13
7 2015-02-04 15:36:29 Streeplex Software 13
1 2015-02-04 21:52:45 Acme Coporation Hardware 14
8 2015-02-05 01:53:06 Acme Coporation Software 19
2 2015-02-07 22:58:10 Acme Coporation Hardware 1
9 2015-02-09 13:09:55 Mediacore Software 7
10 2015-02-11 20:03:08 Initech Software 7
11 2015-02-11 22:50:44 Hooli Software 4
12 2015-02-16 12:09:19 Hooli Software 10
3 2015-02-19 10:59:33 Mediacore Hardware 16
13 2015-02-21 05:01:26 Mediacore Software 3
4 2015-02-21 20:41:47 Hooli Hardware 3

Using merge_ordered()

  • the default is an OUTER join
In [182]:
sales_merged = pd.merge_ordered(hardware, software)
sales_merged.head(14)
Out[182]:
Date Company Product Units
0 2015-02-02 08:33:01 Hooli Software 3
1 2015-02-02 20:54:49 Mediacore Hardware 9
2 2015-02-03 14:14:18 Initech Software 13
3 2015-02-04 15:36:29 Streeplex Software 13
4 2015-02-04 21:52:45 Acme Coporation Hardware 14
5 2015-02-05 01:53:06 Acme Coporation Software 19
6 2015-02-07 22:58:10 Acme Coporation Hardware 1
7 2015-02-09 13:09:55 Mediacore Software 7
8 2015-02-11 20:03:08 Initech Software 7
9 2015-02-11 22:50:44 Hooli Software 4
10 2015-02-16 12:09:19 Hooli Software 10
11 2015-02-19 10:59:33 Mediacore Hardware 16
12 2015-02-21 05:01:26 Mediacore Software 3
13 2015-02-21 20:41:47 Hooli Hardware 3

Using on & suffixes

In [183]:
sales_merged = pd.merge_ordered(hardware, software, on=['Date', 'Company'], suffixes=['_hardware', '_software'])
sales_merged.head()
Out[183]:
Date Company Product_hardware Units_hardware Product_software Units_software
0 2015-02-02 08:33:01 Hooli NaN NaN Software 3.0
1 2015-02-02 20:54:49 Mediacore Hardware 9.0 NaN NaN
2 2015-02-03 14:14:18 Initech NaN NaN Software 13.0
3 2015-02-04 15:36:29 Streeplex NaN NaN Software 13.0
4 2015-02-04 21:52:45 Acme Coporation Hardware 14.0 NaN NaN

Stocks data

In [184]:
stocks_dir = Path.cwd() / 'data' / 'stocks'
sp500_stocks = stocks_dir / 'SP500.csv'
aapl = stocks_dir/ 'AAPL.csv'
csco = stocks_dir/ 'CSCO.csv'
amzn = stocks_dir/ 'AMZN.csv'
msft = stocks_dir/ 'MSFT.csv'
ibm = stocks_dir/ 'IBM.csv'
In [185]:
sp500_df = pd.read_csv(sp500_stocks, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
aapl_df = pd.read_csv(aapl, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
csco_df = pd.read_csv(csco, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
amzn_df = pd.read_csv(amzn, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
msft_df = pd.read_csv(msft, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
ibm_df = pd.read_csv(ibm, usecols=['Date', 'Close'], parse_dates=['Date'], index_col=['Date'])
In [186]:
sp500_df.rename(columns={'Close': 'S&P'}, inplace=True)
aapl_df.rename(columns={'Close': 'AAPL'}, inplace=True)
csco_df.rename(columns={'Close': 'CSCO'}, inplace=True)
amzn_df.rename(columns={'Close': 'AMZN'}, inplace=True)
msft_df.rename(columns={'Close': 'MSFT'}, inplace=True)
ibm_df.rename(columns={'Close': 'IBM'}, inplace=True)
In [187]:
stocks = pd.concat([sp500_df, aapl_df, csco_df, amzn_df, msft_df, ibm_df], axis=1)
In [188]:
stocks.head()
Out[188]:
S&P AAPL CSCO AMZN MSFT IBM
Date
2000-01-03 1455.219971 3.997768 54.03125 89.3750 58.28125 116.0000
2000-01-04 1399.420044 3.660714 51.00000 81.9375 56.31250 112.0625
2000-01-05 1402.109985 3.714286 50.84375 69.7500 56.90625 116.0000
2000-01-06 1403.449951 3.392857 50.00000 65.5625 55.00000 114.0000
2000-01-07 1441.469971 3.553571 52.93750 69.5625 55.71875 113.5000
In [189]:
stocks.tail()
Out[189]:
S&P AAPL CSCO AMZN MSFT IBM
Date
2019-03-18 2832.939941 188.020004 53.509998 1742.150024 117.570000 140.210007
2019-03-19 2832.570068 186.529999 53.310001 1761.849976 117.650002 140.490005
2019-03-20 2824.229980 188.160004 53.259998 1797.270020 117.519997 139.600006
2019-03-21 2854.879883 195.089996 53.939999 1819.260010 120.220001 141.440002
2019-03-22 2800.709961 191.050003 52.740002 1764.770020 117.050003 139.449997
In [190]:
stocks.to_csv(stocks_dir / 'stocks.csv', index=True, index_label='Date')

GDP data

In [191]:
gdp = pd.read_csv(gdp_usa_file, parse_dates=['DATE'])
gdp.sort_values(by=['DATE'], ascending=False, inplace=True)
gdp.reset_index(inplace=True, drop=True)
gdp.rename(columns={'VALUE': 'GDP', 'DATE': 'Date'}, inplace=True)
gdp.head(8)
Out[191]:
Date GDP
0 2016-04-01 18436.5
1 2016-01-01 18281.6
2 2015-10-01 18222.8
3 2015-07-01 18141.9
4 2015-04-01 17998.3
5 2015-01-01 17783.6
6 2014-10-01 17692.2
7 2014-07-01 17569.4

Ordered merge

In [192]:
gdp_2000_2015 = gdp[(gdp['Date'].dt.year >= 2000) & (gdp['Date'].dt.year <= 2015)]
In [193]:
stocks.reset_index(inplace=True)
stocks.head(5)
Out[193]:
Date S&P AAPL CSCO AMZN MSFT IBM
0 2000-01-03 1455.219971 3.997768 54.03125 89.3750 58.28125 116.0000
1 2000-01-04 1399.420044 3.660714 51.00000 81.9375 56.31250 112.0625
2 2000-01-05 1402.109985 3.714286 50.84375 69.7500 56.90625 116.0000
3 2000-01-06 1403.449951 3.392857 50.00000 65.5625 55.00000 114.0000
4 2000-01-07 1441.469971 3.553571 52.93750 69.5625 55.71875 113.5000
In [194]:
stocks_2000_2015 = stocks[(stocks['Date'].dt.year >= 2000) & (stocks['Date'].dt.year <= 2015)]
In [195]:
ordered_df = pd.merge_ordered(stocks_2000_2015, gdp_2000_2015, on='Date')
ordered_df.head()
Out[195]:
Date S&P AAPL CSCO AMZN MSFT IBM GDP
0 2000-01-01 NaN NaN NaN NaN NaN NaN 10031.0
1 2000-01-03 1455.219971 3.997768 54.03125 89.3750 58.28125 116.0000 NaN
2 2000-01-04 1399.420044 3.660714 51.00000 81.9375 56.31250 112.0625 NaN
3 2000-01-05 1402.109985 3.714286 50.84375 69.7500 56.90625 116.0000 NaN
4 2000-01-06 1403.449951 3.392857 50.00000 65.5625 55.00000 114.0000 NaN

Ordered merge with ffill

In [196]:
ordered_df = pd.merge_ordered(stocks_2000_2015, gdp_2000_2015, on='Date', fill_method='ffill')
ordered_df.head()
Out[196]:
Date S&P AAPL CSCO AMZN MSFT IBM GDP
0 2000-01-01 NaN NaN NaN NaN NaN NaN 10031.0
1 2000-01-03 1455.219971 3.997768 54.03125 89.3750 58.28125 116.0000 10031.0
2 2000-01-04 1399.420044 3.660714 51.00000 81.9375 56.31250 112.0625 10031.0
3 2000-01-05 1402.109985 3.714286 50.84375 69.7500 56.90625 116.0000 10031.0
4 2000-01-06 1403.449951 3.392857 50.00000 65.5625 55.00000 114.0000 10031.0
In [197]:
del software, hardware, sales_merge, sales_merged, stocks_dir, sp500_stocks, aapl
del csco, amzn, msft, ibm, sp500_df, aapl_df, csco_df, amzn_df, msft_df, ibm_df, stocks
del gdp, gdp_2000_2015, stocks_2000_2015, ordered_df

Exercises

Using merge_ordered()

This exercise uses pre-loaded DataFrames austin and houston that contain weather data from the cities Austin and Houston respectively. They have been printed in the IPython Shell for you to examine.

Weather conditions were recorded on separate days and you need to merge these two DataFrames together such that the dates are ordered. To do this, you'll use pd.merge_ordered(). After you're done, note the order of the rows before and after merging.

Instructions

  • Perform an ordered merge on austin and houston using pd.merge_ordered(). Store the result as tx_weather.
  • Print tx_weather. You should notice that the rows are sorted by the date but it is not possible to tell which observation came from which city.
  • Perform another ordered merge on austin and houston.
    • This time, specify the keyword arguments on='date' and suffixes=['_aus','_hus'] so that the rows can be distinguished. Store the result as tx_weather_suff.
  • Print tx_weather_suff to examine its contents. This has been done for you.
  • Perform a third ordered merge on austin and houston.
    • This time, in addition to the on and suffixes parameters, specify the keyword argument fill_method='ffill' to use forward-filling to replace NaN entries with the most recent non-null entry, and hit 'Submit Answer' to examine the contents of the merged DataFrames!
In [198]:
austin = pd.DataFrame.from_dict({'date': ['2016-01-01', '2016-02-08', '2016-01-17'], 'ratings': ['Cloudy', 'Cloudy', 'Sunny']})
houston = pd.DataFrame.from_dict({'date': ['2016-01-04', '2016-01-01', '2016-03-01'], 'ratings': ['Rainy', 'Cloudy', 'Sunny']})
In [199]:
# Perform the first ordered merge: tx_weather
tx_weather = pd.merge_ordered(austin, houston)

# Print tx_weather
tx_weather
Out[199]:
date ratings
0 2016-01-01 Cloudy
1 2016-01-04 Rainy
2 2016-01-17 Sunny
3 2016-02-08 Cloudy
4 2016-03-01 Sunny
In [200]:
# Perform the second ordered merge: tx_weather_suff
tx_weather_suff = pd.merge_ordered(austin, houston, on='date', suffixes=['_aus','_hus'])

# Print tx_weather_suff
tx_weather_suff
Out[200]:
date ratings_aus ratings_hus
0 2016-01-01 Cloudy Cloudy
1 2016-01-04 NaN Rainy
2 2016-01-17 Sunny NaN
3 2016-02-08 Cloudy NaN
4 2016-03-01 NaN Sunny
In [201]:
# Perform the third ordered merge: tx_weather_ffill
tx_weather_ffill = pd.merge_ordered(austin, houston, on='date', suffixes=['_aus','_hus'], fill_method='ffill')

# Print tx_weather_ffill
tx_weather_ffill
Out[201]:
date ratings_aus ratings_hus
0 2016-01-01 Cloudy Cloudy
1 2016-01-04 Cloudy Rainy
2 2016-01-17 Sunny Rainy
3 2016-02-08 Cloudy Rainy
4 2016-03-01 Cloudy Sunny
In [ ]:
del austin, houston, tx_weather, tx_weather_suff, tx_weather_ffill

Using merge_asof()

Similar to pd.merge_ordered(), the pd.merge_asof() function will also merge values in order using the on column, but for each row in the left DataFrame, only rows from the right DataFrame whose 'on' column values are less than the left value will be kept.

This function can be used to align disparate datetime frequencies without having to first resample.

Here, you'll merge monthly oil prices (US dollars) into a full automobile fuel efficiency dataset. The oil and automobile DataFrames have been pre-loaded as oil and auto. The first 5 rows of each have been printed in the IPython Shell for you to explore.

These datasets will align such that the first price of the year will be broadcast into the rows of the automobiles DataFrame. This is considered correct since by the start of any given year, most automobiles for that year will have already been manufactured.

You'll then inspect the merged DataFrame, resample by year and compute the mean 'Price' and 'mpg'. You should be able to see a trend in these two columns, that you can confirm by computing the Pearson correlation between resampled 'Price' and 'mpg'.

Instructions

  • Merge auto and oil using pd.merge_asof() with left_on='yr' and ight_on='Date'. Store the result as merged.
  • Print the tail of merged. This has been done for you.
  • Resample merged using 'A' (annual frequency), and on='Date'. Select [['mpg','Price']] and aggregate the mean. Store the result as yearly.
  • Hit Submit Answer to examine the contents of yearly and yearly.corr(), which shows the Pearson correlation between the resampled 'Price' and 'mpg'.
In [203]:
oil = pd.read_csv(oil_price_file, parse_dates=['Date'])
auto = pd.read_csv(auto_fuel_file, parse_dates=['yr'])
In [204]:
oil.head()
Out[204]:
Date Price
0 1970-01-01 3.35
1 1970-02-01 3.35
2 1970-03-01 3.35
3 1970-04-01 3.35
4 1970-05-01 3.35
In [205]:
auto.head()
Out[205]:
mpg cyl displ hp weight accel yr origin name
0 18.0 8 307.0 130 3504 12.0 1970-01-01 US chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 1970-01-01 US buick skylark 320
2 18.0 8 318.0 150 3436 11.0 1970-01-01 US plymouth satellite
3 16.0 8 304.0 150 3433 12.0 1970-01-01 US amc rebel sst
4 17.0 8 302.0 140 3449 10.5 1970-01-01 US ford torino
In [206]:
# Merge auto and oil: merged
merged = pd.merge_asof(auto, oil, left_on='yr', right_on='Date')

# Print the tail of merged
merged.tail()
Out[206]:
mpg cyl displ hp weight accel yr origin name Date Price
387 27.0 4 140.0 86 2790 15.6 1982-01-01 US ford mustang gl 1982-01-01 33.85
388 44.0 4 97.0 52 2130 24.6 1982-01-01 Europe vw pickup 1982-01-01 33.85
389 32.0 4 135.0 84 2295 11.6 1982-01-01 US dodge rampage 1982-01-01 33.85
390 28.0 4 120.0 79 2625 18.6 1982-01-01 US ford ranger 1982-01-01 33.85
391 31.0 4 119.0 82 2720 19.4 1982-01-01 US chevy s-10 1982-01-01 33.85
In [207]:
# Resample merged: yearly
yearly = merged.resample('A', on='Date')[['mpg','Price']].mean()

# Print yearly
yearly
Out[207]:
mpg Price
Date
1970-12-31 17.689655 3.35
1971-12-31 21.111111 3.56
1972-12-31 18.714286 3.56
1973-12-31 17.100000 3.56
1974-12-31 22.769231 10.11
1975-12-31 20.266667 11.16
1976-12-31 21.573529 11.16
1977-12-31 23.375000 13.90
1978-12-31 24.061111 14.85
1979-12-31 25.093103 14.85
1980-12-31 33.803704 32.50
1981-12-31 30.185714 38.00
1982-12-31 32.000000 33.85
In [208]:
# print yearly.corr()
yearly.corr()
Out[208]:
mpg Price
mpg 1.000000 0.948677
Price 0.948677 1.000000

Case Study - Summer Olympics

To cement your new skills, you'll apply them by working on an in-depth study involving Olympic medal data. The analysis involves integrating your multi-DataFrame skills from this course and also skills you've gained in previous pandas courses. This is a rich dataset that will allow you to fully leverage your pandas data manipulation skills. Enjoy!

Medals in the Summer Olympics

Summer Olympic medalists 1896 to 2008 - IOC COUNTRY CODES.csv

In [209]:
pd.read_csv(so_ioc_codes_file).head(8)
Out[209]:
Country NOC ISO code
0 Afghanistan AFG AF
1 Albania ALB AL
2 Algeria ALG DZ
3 American Samoa* ASA AS
4 Andorra AND AD
5 Angola ANG AO
6 Antigua and Barbuda ANT AG
7 Argentina ARG AR

Summer Olympic medalists 1896 to 2008 - EDITIONS.tsv

In [210]:
pd.read_csv(so_editions_file, sep='\t').head(8)
Out[210]:
Edition Bronze Gold Silver Grand Total City Country
0 1896 40 64 47 151 Athens Greece
1 1900 142 178 192 512 Paris France
2 1904 123 188 159 470 St. Louis United States
3 1908 211 311 282 804 London United Kingdom
4 1912 284 301 300 885 Stockholm Sweden
5 1920 355 497 446 1298 Antwerp Belgium
6 1924 285 301 298 884 Paris France
7 1928 242 229 239 710 Amsterdam Netherlands

summer_1896.csv, summer_1900.csv, …, summer_2008.csv

In [211]:
pd.read_csv(so_all_medalists_file, sep='\t', header=4).head(8)
Out[211]:
City Edition Sport Discipline Athlete NOC Gender Event Event_gender Medal
0 Athens 1896 Aquatics Swimming HAJOS, Alfred HUN Men 100m freestyle M Gold
1 Athens 1896 Aquatics Swimming HERSCHMANN, Otto AUT Men 100m freestyle M Silver
2 Athens 1896 Aquatics Swimming DRIVAS, Dimitrios GRE Men 100m freestyle for sailors M Bronze
3 Athens 1896 Aquatics Swimming MALOKINIS, Ioannis GRE Men 100m freestyle for sailors M Gold
4 Athens 1896 Aquatics Swimming CHASAPIS, Spiridon GRE Men 100m freestyle for sailors M Silver
5 Athens 1896 Aquatics Swimming CHOROPHAS, Efstathios GRE Men 1200m freestyle M Bronze
6 Athens 1896 Aquatics Swimming HAJOS, Alfred HUN Men 1200m freestyle M Gold
7 Athens 1896 Aquatics Swimming ANDREOU, Joannis GRE Men 1200m freestyle M Silver

Reminder: loading & merging files

  • pd.read_csv() (& its many options)
  • Looping over files, e.g.,
    • [pd.read_csv(f) for f in glob('*.csv')]
  • Concatenating & appending, e.g.,
    • pd.concat([df1, df2], axis=0)
    • df1.append(df2)

Case Study Explorations

Loading Olympic edition DataFrame

In this chapter, you'll be using The Guardian's Olympic medal dataset.

Your first task here is to prepare a DataFrame editions from a tab-separated values (TSV) file.

Initially, editions has 26 rows (one for each Olympic edition, i.e., a year in which the Olympics was held) and 7 columns: 'Edition', 'Bronze', 'Gold', 'Silver', 'Grand Total', 'City', and 'Country'.

For the analysis that follows, you won't need the overall medal counts, so you want to keep only the useful columns from editions: 'Edition', 'Grand Total', City, and Country.

Instructions

  • Read file_path into a DataFrame called editions. The identifier file_path has been pre-defined with the filename 'Summer Olympic medallists 1896 to 2008 - EDITIONS.tsv'. You'll have to use the option sep='\t' because the file uses tabs to delimit fields (pd.read_csv() expects commas by default).
  • Select only the columns 'Edition', 'Grand Total', 'City', and 'Country' from editions.
  • Print the final DataFrame editions in entirety (there are only 26 rows). This has been done for you, so hit 'Submit Answer' to see the result!
In [212]:
editions = pd.read_csv(so_editions_file, sep='\t')
editions = editions[['Edition', 'Grand Total', 'City', 'Country']]
editions.head()
Out[212]:
Edition Grand Total City Country
0 1896 151 Athens Greece
1 1900 512 Paris France
2 1904 470 St. Louis United States
3 1908 804 London United Kingdom
4 1912 885 Stockholm Sweden

Loading IOC codes DataFrames

Your task here is to prepare a DataFrame ioc_codes from a comma-separated values (CSV) file.

Initially, ioc_codes has 200 rows (one for each country) and 3 columns: 'Country', 'NOC', & 'ISO code'.

For the analysis that follows, you want to keep only the useful columns from ioc_codes: 'Country' and 'NOC' (the column 'NOC' contains three-letter codes representing each country).

Instructions

  • Read file_path into a DataFrame called ioc_codes. The identifier file_path has been pre-defined with the filename 'Summer Olympic medallists 1896 to 2008 - IOC COUNTRY CODES.csv'.
  • Select only the columns 'Country' and 'NOC' from ioc_codes.
  • Print the leading 5 and trailing 5 rows of the DataFrame ioc_codes (there are 200 rows in total). This has been done for you, so hit 'Submit Answer' to see the result!
In [213]:
ioc_codes = pd.read_csv(so_ioc_codes_file)
ioc_codes = ioc_codes[['Country', 'NOC']]
ioc_codes.head()
Out[213]:
Country NOC
0 Afghanistan AFG
1 Albania ALB
2 Algeria ALG
3 American Samoa* ASA
4 Andorra AND

Building medals DataFrame

Here, you'll start with the DataFrame editions from the previous exercise.

You have a sequence of files summer_1896.csv, summer_1900.csv, ..., summer_2008.csv, one for each Olympic edition (year).

You will build up a dictionary medals_dict with the Olympic editions (years) as keys and DataFrames as values.

The dictionary is built up inside a loop over the year of each Olympic edition (from the Index of editions).

Once the dictionary of DataFrames is built up, you will combine the DataFrames using pd.concat().

Instructions

  • Within the for loop:
    • Create the file path. This has been done for you.
    • Read file_path into a DataFrame. Assign the result to the year key of medals_dict.
    • Select only the columns 'Athlete', 'NOC', and 'Medal' from medals_dict[year].
    • Create a new column called 'Edition' in the DataFrame medals_dict[year] whose entries are all year.
  • Concatenate the dictionary of DataFrames medals_dict into a DataFame called medals. Specify the keyword argument ignore_index=True to prevent repeated integer indices.
  • Print the first and last 5 rows of medals. This has been done for you, so hit 'Submit Answer' to see the result!
  • Following is the code used to combine all of the editions by year
    • the individual files are not available
    • the combined dataset is provided
for year in editions['Edition']:

    # Create the file path: file_path
    file_path = 'summer_{:d}.csv'.format(year)

    # Load file_path into a DataFrame: medals_dict[year]
    medals_dict[year] = pd.read_csv(file_path)

    # Extract relevant columns: medals_dict[year]
    medals_dict[year] = medals_dict[year][['Athlete', 'NOC', 'Medal']]

    # Assign year to column 'Edition' of medals_dict
    medals_dict[year]['Edition'] = year

# Concatenate medals_dict: medals
medals = pd.concat(medals_dict, ignore_index=True)
In [214]:
medals = pd.read_csv(so_all_medalists_file, sep='\t', header=4)
medals = medals[['Athlete', 'NOC', 'Medal', 'Edition']]
medals.head()
Out[214]:
Athlete NOC Medal Edition
0 HAJOS, Alfred HUN Gold 1896
1 HERSCHMANN, Otto AUT Silver 1896
2 DRIVAS, Dimitrios GRE Bronze 1896
3 MALOKINIS, Ioannis GRE Gold 1896
4 CHASAPIS, Spiridon GRE Silver 1896

Quantifying Performance

Constructing a pivot table

  • Apply DataFrame pivot_table() method
    • index: column to use as index of pivot table
    • values: column(s) to aggregate
    • aggfunc: function to apply for aggregation
    • columns: categories as columns of pivot table

Case Study Explorations

Counting medals by country/edition in a pivot table

Here, you'll start with the concatenated DataFrame medals from the previous exercise.

You can construct a pivot table to see the number of medals each country won in each year. The result is a new DataFrame with the Olympic edition on the Index and with 138 country NOC codes as columns. If you want a refresher on pivot tables, it may be useful to refer back to the relevant exercises in Manipulating DataFrames with pandas.

Instructions

  • Construct a pivot table from the DataFrame medals, aggregating by count (by specifying the aggfunc parameter). Use 'Edition' as the index, 'Athlete' for the values, and 'NOC' for the columns.
  • Print the first & last 5 rows of medal_counts. This has been done for you, so hit 'Submit Answer' to see the results!
In [215]:
# Construct the pivot_table: medal_counts
medal_counts = medals.pivot_table(index='Edition', columns='NOC', values='Athlete', aggfunc='count')

# Print the first & last 5 rows of medal_counts
medal_counts.head()
Out[215]:
NOC AFG AHO ALG ANZ ARG ARM AUS AUT AZE BAH BAR BDI BEL BER BLR BOH BRA BUL BWI CAN CHI CHN CIV CMR COL CRC CRO CUB CZE DEN DJI DOM ECU EGY ERI ESP EST ETH EUA EUN FIN FRA FRG GBR GDR GEO GER GHA GRE GUY HAI HKG HUN INA IND IOP IRI IRL IRQ ISL ISR ISV ITA JAM JPN KAZ KEN KGZ KOR KSA KUW LAT LIB LTU LUX MAR MAS MDA MEX MGL MKD MOZ MRI NAM NED NGR NIG NOR NZL PAK PAN PAR PER PHI POL POR PRK PUR QAT ROU RSA RU1 RUS SCG SEN SIN SLO SRB SRI SUD SUI SUR SVK SWE SYR TAN TCH TGA THA TJK TOG TPE TRI TUN TUR UAE UGA UKR URS URU USA UZB VEN VIE YUG ZAM ZIM ZZX
Edition
1896 NaN NaN NaN NaN NaN NaN 2.0 5.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 11.0 NaN 7.0 NaN NaN 33.0 NaN 52.0 NaN NaN NaN 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 20.0 NaN NaN NaN NaN NaN NaN 6.0
1900 NaN NaN NaN NaN NaN NaN 5.0 6.0 NaN NaN NaN NaN 39.0 NaN NaN 2.0 NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN 2.0 NaN 6.0 NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN 185.0 NaN 78.0 NaN NaN 40.0 NaN NaN NaN NaN NaN 5.0 NaN 2.0 NaN NaN NaN NaN NaN NaN NaN 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 20.0 NaN NaN 9.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15.0 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 55.0 NaN NaN NaN NaN NaN NaN 34.0
1904 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 35.0 NaN NaN NaN NaN NaN NaN NaN 9.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 NaN NaN 13.0 NaN 2.0 NaN NaN NaN 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 394.0 NaN NaN NaN NaN NaN NaN 8.0
1908 NaN NaN NaN 19.0 NaN NaN NaN 1.0 NaN NaN NaN NaN 31.0 NaN NaN 5.0 NaN NaN NaN 51.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 15.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 29.0 35.0 NaN 347.0 NaN NaN 22.0 NaN 3.0 NaN NaN NaN 18.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 11.0 NaN NaN 44.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 98.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 63.0 NaN NaN NaN NaN NaN NaN NaN
1912 NaN NaN NaN 10.0 NaN NaN NaN 14.0 NaN NaN NaN NaN 19.0 NaN NaN NaN NaN NaN NaN 8.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 84.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 67.0 25.0 NaN 160.0 NaN NaN 52.0 NaN 2.0 NaN NaN NaN 30.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 21.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 22.0 NaN NaN 76.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.0 14.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 173.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 101.0 NaN NaN NaN NaN NaN NaN NaN
In [216]:
medal_counts.tail()
Out[216]:
NOC AFG AHO ALG ANZ ARG ARM AUS AUT AZE BAH BAR BDI BEL BER BLR BOH BRA BUL BWI CAN CHI CHN CIV CMR COL CRC CRO CUB CZE DEN DJI DOM ECU EGY ERI ESP EST ETH EUA EUN FIN FRA FRG GBR GDR GEO GER GHA GRE GUY HAI HKG HUN INA IND IOP IRI IRL IRQ ISL ISR ISV ITA JAM JPN KAZ KEN KGZ KOR KSA KUW LAT LIB LTU LUX MAR MAS MDA MEX MGL MKD MOZ MRI NAM NED NGR NIG NOR NZL PAK PAN PAR PER PHI POL POR PRK PUR QAT ROU RSA RU1 RUS SCG SEN SIN SLO SRB SRI SUD SUI SUR SVK SWE SYR TAN TCH TGA THA TJK TOG TPE TRI TUN TUR UAE UGA UKR URS URU USA UZB VEN VIE YUG ZAM ZIM ZZX
Edition
1992 NaN NaN 2.0 NaN 2.0 NaN 57.0 6.0 NaN 1.0 NaN NaN 3.0 NaN NaN NaN 14.0 17.0 NaN 44.0 NaN 83.0 NaN NaN 1.0 NaN 15.0 71.0 NaN 14.0 NaN NaN NaN NaN NaN 66.0 3.0 3.0 NaN 223.0 7.0 57.0 NaN 50.0 NaN NaN 198.0 13.0 2.0 NaN NaN NaN 45.0 6.0 NaN 3.0 3.0 2.0 NaN NaN 2.0 NaN 46.0 4.0 47.0 NaN 8.0 NaN 49.0 NaN NaN 3.0 NaN 13.0 NaN 3.0 2.0 NaN 1.0 2.0 NaN NaN NaN 2.0 33.0 11.0 NaN 23.0 15.0 16.0 NaN NaN 1.0 1.0 42.0 NaN 10.0 1.0 1.0 53.0 3.0 NaN NaN NaN NaN NaN 6.0 NaN NaN NaN 1.0 1.0 NaN 35.0 NaN NaN 8.0 NaN 1.0 NaN NaN 20.0 NaN NaN 6.0 NaN NaN NaN NaN NaN 224.0 NaN NaN NaN NaN NaN NaN NaN
1996 NaN NaN 3.0 NaN 20.0 2.0 132.0 3.0 1.0 5.0 NaN 1.0 6.0 NaN 23.0 NaN 64.0 21.0 NaN 51.0 NaN 110.0 NaN NaN NaN 1.0 29.0 57.0 13.0 24.0 NaN NaN 1.0 NaN NaN 67.0 NaN 3.0 NaN NaN 4.0 51.0 NaN 26.0 NaN 2.0 124.0 NaN 8.0 NaN NaN 1.0 43.0 6.0 1.0 NaN 3.0 4.0 NaN NaN 1.0 NaN 71.0 16.0 43.0 11.0 8.0 NaN 66.0 NaN NaN 1.0 NaN 12.0 NaN 2.0 3.0 3.0 1.0 1.0 NaN 1.0 NaN 2.0 73.0 26.0 NaN 25.0 9.0 NaN NaN NaN NaN 1.0 21.0 3.0 5.0 1.0 NaN 38.0 5.0 NaN 115.0 NaN NaN NaN 2.0 NaN NaN NaN 11.0 NaN 3.0 31.0 1.0 NaN NaN 1.0 2.0 NaN NaN 1.0 2.0 1.0 6.0 NaN 1.0 34.0 NaN NaN 260.0 2.0 NaN NaN 26.0 1.0 NaN NaN
2000 NaN NaN 5.0 NaN 20.0 1.0 183.0 4.0 3.0 6.0 1.0 NaN 7.0 NaN 22.0 NaN 48.0 13.0 NaN 31.0 18.0 79.0 NaN 18.0 1.0 2.0 10.0 69.0 9.0 25.0 NaN NaN NaN NaN NaN 43.0 3.0 8.0 NaN NaN 5.0 66.0 NaN 55.0 NaN 6.0 119.0 NaN 18.0 NaN NaN NaN 53.0 8.0 1.0 NaN 4.0 1.0 NaN 1.0 1.0 NaN 65.0 23.0 43.0 7.0 7.0 1.0 73.0 2.0 1.0 3.0 NaN 17.0 NaN 5.0 NaN 2.0 6.0 NaN 1.0 1.0 NaN NaN 79.0 8.0 NaN 44.0 4.0 NaN NaN NaN NaN NaN 24.0 2.0 4.0 NaN 1.0 46.0 5.0 NaN 188.0 NaN NaN NaN 3.0 NaN 1.0 NaN 14.0 NaN 6.0 32.0 NaN NaN NaN NaN 3.0 NaN NaN 5.0 2.0 NaN 5.0 NaN NaN 35.0 NaN 1.0 248.0 4.0 NaN 1.0 26.0 NaN NaN NaN
2004 NaN NaN NaN NaN 47.0 NaN 157.0 8.0 5.0 2.0 NaN NaN 3.0 NaN 17.0 NaN 40.0 17.0 NaN 17.0 4.0 94.0 NaN 1.0 2.0 NaN 20.0 61.0 12.0 29.0 NaN 1.0 NaN 5.0 1.0 27.0 3.0 7.0 NaN NaN 2.0 53.0 NaN 57.0 NaN 4.0 149.0 NaN 31.0 NaN NaN 2.0 40.0 5.0 1.0 NaN 6.0 NaN NaN NaN 2.0 NaN 102.0 13.0 94.0 8.0 7.0 NaN 52.0 NaN NaN 4.0 NaN 3.0 NaN 3.0 NaN NaN 4.0 1.0 NaN NaN NaN NaN 76.0 8.0 NaN 7.0 6.0 NaN NaN 17.0 NaN NaN 12.0 3.0 5.0 NaN NaN 39.0 10.0 NaN 192.0 14.0 NaN NaN 5.0 NaN NaN NaN 7.0 NaN 10.0 12.0 1.0 NaN NaN NaN 8.0 NaN NaN 9.0 1.0 NaN 10.0 1.0 NaN 48.0 NaN NaN 264.0 5.0 2.0 NaN NaN NaN 3.0 NaN
2008 1.0 NaN 2.0 NaN 51.0 6.0 149.0 3.0 7.0 5.0 NaN NaN 5.0 NaN 30.0 NaN 75.0 5.0 NaN 34.0 1.0 184.0 NaN 1.0 2.0 NaN 5.0 47.0 7.0 18.0 NaN 2.0 1.0 1.0 NaN 71.0 3.0 7.0 NaN NaN 5.0 76.0 NaN 77.0 NaN 6.0 101.0 NaN 7.0 NaN NaN NaN 27.0 7.0 3.0 NaN 2.0 3.0 NaN 14.0 1.0 NaN 42.0 17.0 51.0 13.0 14.0 2.0 78.0 NaN NaN 3.0 NaN 5.0 NaN 2.0 1.0 1.0 4.0 4.0 NaN NaN 1.0 NaN 62.0 24.0 NaN 22.0 14.0 NaN 1.0 NaN NaN NaN 20.0 2.0 6.0 NaN NaN 22.0 1.0 NaN 143.0 NaN NaN 3.0 5.0 15.0 NaN 1.0 11.0 NaN 10.0 7.0 NaN NaN NaN NaN 4.0 2.0 1.0 4.0 5.0 1.0 8.0 NaN NaN 31.0 NaN NaN 315.0 6.0 1.0 1.0 NaN NaN 4.0 NaN

Computing fraction of medals per Olympic edition

In this exercise, you'll start with the DataFrames editions, medals, & medal_counts from prior exercises.

You can extract a Series with the total number of medals awarded in each Olympic edition.

The DataFrame medal_counts can be divided row-wise by the total number of medals awarded each edition; the method .divide() performs the broadcast as you require.

This gives you a normalized indication of each country's performance in each edition.

Instructions

  • Set the index of the DataFrame editions to be 'Edition' (using the method .set_index()). Save the result as totals.
  • Extract the 'Grand Total' column from totals and assign the result back to totals.
  • Divide the DataFrame medal_counts by totals along each row. You will have to use the .divide() method with the option axis='rows'. Assign the result to fractions.
  • Print first & last 5 rows of the DataFrame fractions. This has been done for you, so hit 'Submit Answer' to see the results!
In [217]:
# Set Index of editions: totals
totals = editions.set_index('Edition')
totals.head()
Out[217]:
Grand Total City Country
Edition
1896 151 Athens Greece
1900 512 Paris France
1904 470 St. Louis United States
1908 804 London United Kingdom
1912 885 Stockholm Sweden
In [218]:
# Reassign totals['Grand Total']: totals
totals = totals['Grand Total']
totals.head()
Out[218]:
Edition
1896    151
1900    512
1904    470
1908    804
1912    885
Name: Grand Total, dtype: int64
In [219]:
# Divide medal_counts by totals: fractions
fractions = medal_counts.divide(totals, axis='rows')

# Print first & last 5 rows of fractions
fractions.head()
Out[219]:
NOC AFG AHO ALG ANZ ARG ARM AUS AUT AZE BAH BAR BDI BEL BER BLR BOH BRA BUL BWI CAN CHI CHN CIV CMR COL CRC CRO CUB CZE DEN DJI DOM ECU EGY ERI ESP EST ETH EUA EUN FIN FRA FRG GBR GDR GEO GER GHA GRE GUY HAI HKG HUN INA IND IOP IRI IRL IRQ ISL ISR ISV ITA JAM JPN KAZ KEN KGZ KOR KSA KUW LAT LIB LTU LUX MAR MAS MDA MEX MGL MKD MOZ MRI NAM NED NGR NIG NOR NZL PAK PAN PAR PER PHI POL POR PRK PUR QAT ROU RSA RU1 RUS SCG SEN SIN SLO SRB SRI SUD SUI SUR SVK SWE SYR TAN TCH TGA THA TJK TOG TPE TRI TUN TUR UAE UGA UKR URS URU USA UZB VEN VIE YUG ZAM ZIM ZZX
Edition
1896 NaN NaN NaN NaN NaN NaN 0.013245 0.033113 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.039735 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.072848 NaN 0.046358 NaN NaN 0.218543 NaN 0.344371 NaN NaN NaN 0.039735 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.019868 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.132450 NaN NaN NaN NaN NaN NaN 0.039735
1900 NaN NaN NaN NaN NaN NaN 0.009766 0.011719 NaN NaN NaN NaN 0.076172 NaN NaN 0.003906 NaN NaN NaN 0.003906 NaN NaN NaN NaN NaN NaN NaN 0.003906 NaN 0.011719 NaN NaN NaN NaN NaN 0.003906 NaN NaN NaN NaN NaN 0.361328 NaN 0.152344 NaN NaN 0.078125 NaN NaN NaN NaN NaN 0.009766 NaN 0.003906 NaN NaN NaN NaN NaN NaN NaN 0.007812 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.039062 NaN NaN 0.017578 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.029297 NaN NaN 0.001953 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.107422 NaN NaN NaN NaN NaN NaN 0.066406
1904 NaN NaN NaN NaN NaN NaN NaN 0.002128 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.074468 NaN NaN NaN NaN NaN NaN NaN 0.019149 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.004255 NaN NaN 0.027660 NaN 0.004255 NaN NaN NaN 0.008511 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.004255 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.838298 NaN NaN NaN NaN NaN NaN 0.017021
1908 NaN NaN NaN 0.023632 NaN NaN NaN 0.001244 NaN NaN NaN NaN 0.038557 NaN NaN 0.006219 NaN NaN NaN 0.063433 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.018657 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.036070 0.043532 NaN 0.431592 NaN NaN 0.027363 NaN 0.003731 NaN NaN NaN 0.022388 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.008706 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.013682 NaN NaN 0.054726 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.002488 0.003731 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.121891 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.078358 NaN NaN NaN NaN NaN NaN NaN
1912 NaN NaN NaN 0.011299 NaN NaN NaN 0.015819 NaN NaN NaN NaN 0.021469 NaN NaN NaN NaN NaN NaN 0.009040 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.094915 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.075706 0.028249 NaN 0.180791 NaN NaN 0.058757 NaN 0.002260 NaN NaN NaN 0.033898 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.023729 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.024859 NaN NaN 0.085876 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.007910 0.015819 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.195480 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.114124 NaN NaN NaN NaN NaN NaN NaN
In [220]:
fractions.tail()
Out[220]:
NOC AFG AHO ALG ANZ ARG ARM AUS AUT AZE BAH BAR BDI BEL BER BLR BOH BRA BUL BWI CAN CHI CHN CIV CMR COL CRC CRO CUB CZE DEN DJI DOM ECU EGY ERI ESP EST ETH EUA EUN FIN FRA FRG GBR GDR GEO GER GHA GRE GUY HAI HKG HUN INA IND IOP IRI IRL IRQ ISL ISR ISV ITA JAM JPN KAZ KEN KGZ KOR KSA KUW LAT LIB LTU LUX MAR MAS MDA MEX MGL MKD MOZ MRI NAM NED NGR NIG NOR NZL PAK PAN PAR PER PHI POL POR PRK PUR QAT ROU RSA RU1 RUS SCG SEN SIN SLO SRB SRI SUD SUI SUR SVK SWE SYR TAN TCH TGA THA TJK TOG TPE TRI TUN TUR UAE UGA UKR URS URU USA UZB VEN VIE YUG ZAM ZIM ZZX
Edition
1992 NaN NaN 0.001173 NaN 0.001173 NaN 0.033431 0.003519 NaN 0.000587 NaN NaN 0.001760 NaN NaN NaN 0.008211 0.009971 NaN 0.025806 NaN 0.048680 NaN NaN 0.000587 NaN 0.008798 0.041642 NaN 0.008211 NaN NaN NaN NaN NaN 0.038710 0.001760 0.001760 NaN 0.130792 0.004106 0.033431 NaN 0.029326 NaN NaN 0.116129 0.007625 0.001173 NaN NaN NaN 0.026393 0.003519 NaN 0.00176 0.001760 0.001173 NaN NaN 0.001173 NaN 0.026979 0.002346 0.027566 NaN 0.004692 NaN 0.028739 NaN NaN 0.001760 NaN 0.007625 NaN 0.001760 0.001173 NaN 0.000587 0.001173 NaN NaN NaN 0.001173 0.019355 0.006452 NaN 0.013490 0.008798 0.009384 NaN NaN 0.000587 0.000587 0.024633 NaN 0.005865 0.000587 0.000587 0.031085 0.001760 NaN NaN NaN NaN NaN 0.003519 NaN NaN NaN 0.000587 0.000587 NaN 0.020528 NaN NaN 0.004692 NaN 0.000587 NaN NaN 0.011730 NaN NaN 0.003519 NaN NaN NaN NaN NaN 0.131378 NaN NaN NaN NaN NaN NaN NaN
1996 NaN NaN 0.001614 NaN 0.010758 0.001076 0.071006 0.001614 0.000538 0.002690 NaN 0.000538 0.003228 NaN 0.012372 NaN 0.034427 0.011296 NaN 0.027434 NaN 0.059172 NaN NaN NaN 0.000538 0.015600 0.030662 0.006993 0.012910 NaN NaN 0.000538 NaN NaN 0.036041 NaN 0.001614 NaN NaN 0.002152 0.027434 NaN 0.013986 NaN 0.001076 0.066703 NaN 0.004303 NaN NaN 0.000538 0.023131 0.003228 0.000538 NaN 0.001614 0.002152 NaN NaN 0.000538 NaN 0.038193 0.008607 0.023131 0.005917 0.004303 NaN 0.035503 NaN NaN 0.000538 NaN 0.006455 NaN 0.001076 0.001614 0.001614 0.000538 0.000538 NaN 0.000538 NaN 0.001076 0.039268 0.013986 NaN 0.013448 0.004841 NaN NaN NaN NaN 0.000538 0.011296 0.001614 0.002690 0.000538 NaN 0.020441 0.002690 NaN 0.061861 NaN NaN NaN 0.001076 NaN NaN NaN 0.005917 NaN 0.001614 0.016676 0.000538 NaN NaN 0.000538 0.001076 NaN NaN 0.000538 0.001076 0.000538 0.003228 NaN 0.000538 0.018289 NaN NaN 0.139860 0.001076 NaN NaN 0.013986 0.000538 NaN NaN
2000 NaN NaN 0.002481 NaN 0.009926 0.000496 0.090819 0.001985 0.001489 0.002978 0.000496 NaN 0.003474 NaN 0.010918 NaN 0.023821 0.006452 NaN 0.015385 0.008933 0.039206 NaN 0.008933 0.000496 0.000993 0.004963 0.034243 0.004467 0.012407 NaN NaN NaN NaN NaN 0.021340 0.001489 0.003970 NaN NaN 0.002481 0.032754 NaN 0.027295 NaN 0.002978 0.059057 NaN 0.008933 NaN NaN NaN 0.026303 0.003970 0.000496 NaN 0.001985 0.000496 NaN 0.000496 0.000496 NaN 0.032258 0.011414 0.021340 0.003474 0.003474 0.000496 0.036228 0.000993 0.000496 0.001489 NaN 0.008437 NaN 0.002481 NaN 0.000993 0.002978 NaN 0.000496 0.000496 NaN NaN 0.039206 0.003970 NaN 0.021836 0.001985 NaN NaN NaN NaN NaN 0.011911 0.000993 0.001985 NaN 0.000496 0.022829 0.002481 NaN 0.093300 NaN NaN NaN 0.001489 NaN 0.000496 NaN 0.006948 NaN 0.002978 0.015881 NaN NaN NaN NaN 0.001489 NaN NaN 0.002481 0.000993 NaN 0.002481 NaN NaN 0.017370 NaN 0.000496 0.123077 0.001985 NaN 0.000496 0.012903 NaN NaN NaN
2004 NaN NaN NaN NaN 0.023524 NaN 0.078579 0.004004 0.002503 0.001001 NaN NaN 0.001502 NaN 0.008509 NaN 0.020020 0.008509 NaN 0.008509 0.002002 0.047047 NaN 0.000501 0.001001 NaN 0.010010 0.030531 0.006006 0.014515 NaN 0.000501 NaN 0.002503 0.000501 0.013514 0.001502 0.003504 NaN NaN 0.001001 0.026527 NaN 0.028529 NaN 0.002002 0.074575 NaN 0.015516 NaN NaN 0.001001 0.020020 0.002503 0.000501 NaN 0.003003 NaN NaN NaN 0.001001 NaN 0.051051 0.006507 0.047047 0.004004 0.003504 NaN 0.026026 NaN NaN 0.002002 NaN 0.001502 NaN 0.001502 NaN NaN 0.002002 0.000501 NaN NaN NaN NaN 0.038038 0.004004 NaN 0.003504 0.003003 NaN NaN 0.008509 NaN NaN 0.006006 0.001502 0.002503 NaN NaN 0.019520 0.005005 NaN 0.096096 0.007007 NaN NaN 0.002503 NaN NaN NaN 0.003504 NaN 0.005005 0.006006 0.000501 NaN NaN NaN 0.004004 NaN NaN 0.004505 0.000501 NaN 0.005005 0.000501 NaN 0.024024 NaN NaN 0.132132 0.002503 0.001001 NaN NaN NaN 0.001502 NaN
2008 0.00049 NaN 0.000979 NaN 0.024976 0.002938 0.072968 0.001469 0.003428 0.002449 NaN NaN 0.002449 NaN 0.014691 NaN 0.036729 0.002449 NaN 0.016650 0.000490 0.090108 NaN 0.000490 0.000979 NaN 0.002449 0.023017 0.003428 0.008815 NaN 0.000979 0.000490 0.000490 NaN 0.034770 0.001469 0.003428 NaN NaN 0.002449 0.037218 NaN 0.037708 NaN 0.002938 0.049461 NaN 0.003428 NaN NaN NaN 0.013222 0.003428 0.001469 NaN 0.000979 0.001469 NaN 0.006856 0.000490 NaN 0.020568 0.008325 0.024976 0.006366 0.006856 0.000979 0.038198 NaN NaN 0.001469 NaN 0.002449 NaN 0.000979 0.000490 0.000490 0.001959 0.001959 NaN NaN 0.00049 NaN 0.030362 0.011753 NaN 0.010774 0.006856 NaN 0.00049 NaN NaN NaN 0.009794 0.000979 0.002938 NaN NaN 0.010774 0.000490 NaN 0.070029 NaN NaN 0.001469 0.002449 0.007346 NaN 0.00049 0.005387 NaN 0.004897 0.003428 NaN NaN NaN NaN 0.001959 0.000979 0.00049 0.001959 0.002449 0.000490 0.003918 NaN NaN 0.015181 NaN NaN 0.154261 0.002938 0.000490 0.000490 NaN NaN 0.001959 NaN

Computing percentage change in fraction of medals won

Here, you'll start with the DataFrames editions, medals, medal_counts, & fractions from prior exercises.

To see if there is a host country advantage, you first want to see how the fraction of medals won changes from edition to edition.

The expanding mean provides a way to see this down each column. It is the value of the mean with all the data available up to that point in time. If you are interested in learning more about pandas' expanding transformations, this section of the pandas documentation has additional information.

Instructions

  • Create mean_fractions by chaining the methods .expanding().mean() to fractions.
  • Compute the percentage change in mean_fractions down each column by applying .pct_change() and multiplying by 100. Assign the result to fractions_change.
  • Reset the index of fractions_change using the .reset_index() method. This will make 'Edition' an ordinary column.
  • Print the first and last 5 rows of the DataFrame fractions_change. This has been done for you, so hit 'Submit Answer' to see the results!
In [221]:
# Apply the expanding mean: mean_fractions
mean_fractions = fractions.expanding().mean()
mean_fractions.head()
Out[221]:
NOC AFG AHO ALG ANZ ARG ARM AUS AUT AZE BAH BAR BDI BEL BER BLR BOH BRA BUL BWI CAN CHI CHN CIV CMR COL CRC CRO CUB CZE DEN DJI DOM ECU EGY ERI ESP EST ETH EUA EUN FIN FRA FRG GBR GDR GEO GER GHA GRE GUY HAI HKG HUN INA IND IOP IRI IRL IRQ ISL ISR ISV ITA JAM JPN KAZ KEN KGZ KOR KSA KUW LAT LIB LTU LUX MAR MAS MDA MEX MGL MKD MOZ MRI NAM NED NGR NIG NOR NZL PAK PAN PAR PER PHI POL POR PRK PUR QAT ROU RSA RU1 RUS SCG SEN SIN SLO SRB SRI SUD SUI SUR SVK SWE SYR TAN TCH TGA THA TJK TOG TPE TRI TUN TUR UAE UGA UKR URS URU USA UZB VEN VIE YUG ZAM ZIM ZZX
Edition
1896 NaN NaN NaN NaN NaN NaN 0.013245 0.033113 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.039735 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.072848 NaN 0.046358 NaN NaN 0.218543 NaN 0.344371 NaN NaN NaN 0.039735 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.019868 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.132450 NaN NaN NaN NaN NaN NaN 0.039735
1900 NaN NaN NaN NaN NaN NaN 0.011505 0.022416 NaN NaN NaN NaN 0.076172 NaN NaN 0.003906 NaN NaN NaN 0.003906 NaN NaN NaN NaN NaN NaN NaN 0.003906 NaN 0.025727 NaN NaN NaN NaN NaN 0.003906 NaN NaN NaN NaN NaN 0.217088 NaN 0.099351 NaN NaN 0.148334 NaN 0.344371 NaN NaN NaN 0.024750 NaN 0.003906 NaN NaN NaN NaN NaN NaN NaN 0.007812 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.039062 NaN NaN 0.017578 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.024582 NaN NaN 0.001953 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.119936 NaN NaN NaN NaN NaN NaN 0.053071
1904 NaN NaN NaN NaN NaN NaN 0.011505 0.015653 NaN NaN NaN NaN 0.076172 NaN NaN 0.003906 NaN NaN NaN 0.039187 NaN NaN NaN NaN NaN NaN NaN 0.011528 NaN 0.025727 NaN NaN NaN NaN NaN 0.003906 NaN NaN NaN NaN NaN 0.217088 NaN 0.067652 NaN NaN 0.108109 NaN 0.174313 NaN NaN NaN 0.019337 NaN 0.003906 NaN NaN NaN NaN NaN NaN NaN 0.007812 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.039062 NaN NaN 0.017578 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.017807 NaN NaN 0.001953 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.359390 NaN NaN NaN NaN NaN NaN 0.041054
1908 NaN NaN NaN 0.023632 NaN NaN 0.011505 0.012051 NaN NaN NaN NaN 0.057365 NaN NaN 0.005063 NaN NaN NaN 0.047269 NaN NaN NaN NaN NaN NaN NaN 0.011528 NaN 0.023370 NaN NaN NaN NaN NaN 0.003906 NaN NaN NaN NaN 0.036070 0.159236 NaN 0.158637 NaN NaN 0.087923 NaN 0.117453 NaN NaN NaN 0.020100 NaN 0.003906 NaN NaN NaN NaN NaN NaN NaN 0.008259 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.026372 NaN NaN 0.036152 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.002488 0.003731 NaN NaN NaN NaN NaN NaN NaN NaN 0.017807 NaN NaN 0.061922 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.289132 NaN NaN NaN NaN NaN NaN 0.041054
1912 NaN NaN NaN 0.017466 NaN NaN 0.011505 0.012804 NaN NaN NaN NaN 0.045399 NaN NaN 0.005063 NaN NaN NaN 0.037712 NaN NaN NaN NaN NaN NaN NaN 0.011528 NaN 0.041256 NaN NaN NaN NaN NaN 0.003906 NaN NaN NaN NaN 0.055888 0.126489 NaN 0.163068 NaN NaN 0.082090 NaN 0.088654 NaN NaN NaN 0.022860 NaN 0.003906 NaN NaN NaN NaN NaN NaN NaN 0.013416 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.025868 NaN NaN 0.052727 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.005199 0.009775 NaN NaN NaN NaN NaN NaN NaN NaN 0.017807 NaN NaN 0.106441 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.254131 NaN NaN NaN NaN NaN NaN 0.041054
In [222]:
# Compute the percentage change: fractions_change
fractions_change = mean_fractions.pct_change()*100
fractions_change.head()
Out[222]:
NOC AFG AHO ALG ANZ ARG ARM AUS AUT AZE BAH BAR BDI BEL BER BLR BOH BRA BUL BWI CAN CHI CHN CIV CMR COL CRC CRO CUB CZE DEN DJI DOM ECU EGY ERI ESP EST ETH EUA EUN FIN FRA FRG GBR GDR GEO GER GHA GRE GUY HAI HKG HUN INA IND IOP IRI IRL IRQ ISL ISR ISV ITA JAM JPN KAZ KEN KGZ KOR KSA KUW LAT LIB LTU LUX MAR MAS MDA MEX MGL MKD MOZ MRI NAM NED NGR NIG NOR NZL PAK PAN PAR PER PHI POL POR PRK PUR QAT ROU RSA RU1 RUS SCG SEN SIN SLO SRB SRI SUD SUI SUR SVK SWE SYR TAN TCH TGA THA TJK TOG TPE TRI TUN TUR UAE UGA UKR URS URU USA UZB VEN VIE YUG ZAM ZIM ZZX
Edition
1896 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1900 NaN NaN NaN NaN NaN NaN -13.134766 -32.304688 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -35.253906 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 198.002486 NaN 114.313616 NaN NaN -32.125947 NaN 0.000000 NaN NaN NaN -37.711589 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 23.730469 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -9.448242 NaN NaN NaN NaN NaN NaN 33.561198
1904 NaN NaN NaN NaN NaN NaN 0.000000 -30.169386 NaN NaN NaN NaN 0.000000 NaN NaN 0.00000 NaN NaN NaN 903.191489 NaN NaN NaN NaN NaN NaN NaN 195.106383 NaN 0.000000 NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN 0.000000 NaN -31.905623 NaN NaN -27.117728 NaN -49.382160 NaN NaN NaN -21.871362 NaN 0.0 NaN NaN NaN NaN NaN NaN NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -27.563146 NaN NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 199.651245 NaN NaN NaN NaN NaN NaN -22.642384
1908 NaN NaN NaN NaN NaN NaN 0.000000 -23.013510 NaN NaN NaN NaN -24.690649 NaN NaN 29.60199 NaN NaN NaN 20.623816 NaN NaN NaN NaN NaN NaN NaN 0.000000 NaN -9.160582 NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN -26.649046 NaN 134.489218 NaN NaN -18.672328 NaN -32.619801 NaN NaN NaN 3.944407 NaN 0.0 NaN NaN NaN NaN NaN NaN NaN 5.721393 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -32.487562 NaN NaN 105.666114 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 NaN NaN 3070.398010 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -19.549222 NaN NaN NaN NaN NaN NaN 0.000000
1912 NaN NaN NaN -26.092774 NaN NaN 0.000000 6.254438 NaN NaN NaN NaN -20.858191 NaN NaN 0.00000 NaN NaN NaN -20.219099 NaN NaN NaN NaN NaN NaN NaN 0.000000 NaN 76.534540 NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 54.944477 -20.564982 NaN 2.793012 NaN NaN -6.634382 NaN -24.518979 NaN NaN NaN 13.729899 NaN 0.0 NaN NaN NaN NaN NaN NaN NaN 62.430575 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -1.912744 NaN NaN 45.846353 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 108.983051 161.977401 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 NaN NaN 71.896226 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -12.105733 NaN NaN NaN NaN NaN NaN 0.000000
In [223]:
# Reset the index of fractions_change: fractions_change
fractions_change = fractions_change.reset_index()

# Print first & last 5 rows of fractions_change
fractions_change.head()
Out[223]:
NOC Edition AFG AHO ALG ANZ ARG ARM AUS AUT AZE BAH BAR BDI BEL BER BLR BOH BRA BUL BWI CAN CHI CHN CIV CMR COL CRC CRO CUB CZE DEN DJI DOM ECU EGY ERI ESP EST ETH EUA EUN FIN FRA FRG GBR GDR GEO GER GHA GRE GUY HAI HKG HUN INA IND IOP IRI IRL IRQ ISL ISR ISV ITA JAM JPN KAZ KEN KGZ KOR KSA KUW LAT LIB LTU LUX MAR MAS MDA MEX MGL MKD MOZ MRI NAM NED NGR NIG NOR NZL PAK PAN PAR PER PHI POL POR PRK PUR QAT ROU RSA RU1 RUS SCG SEN SIN SLO SRB SRI SUD SUI SUR SVK SWE SYR TAN TCH TGA THA TJK TOG TPE TRI TUN TUR UAE UGA UKR URS URU USA UZB VEN VIE YUG ZAM ZIM ZZX
0 1896 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1900 NaN NaN NaN NaN NaN NaN -13.134766 -32.304688 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -35.253906 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 198.002486 NaN 114.313616 NaN NaN -32.125947 NaN 0.000000 NaN NaN NaN -37.711589 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 23.730469 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -9.448242 NaN NaN NaN NaN NaN NaN 33.561198
2 1904 NaN NaN NaN NaN NaN NaN 0.000000 -30.169386 NaN NaN NaN NaN 0.000000 NaN NaN 0.00000 NaN NaN NaN 903.191489 NaN NaN NaN NaN NaN NaN NaN 195.106383 NaN 0.000000 NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN 0.000000 NaN -31.905623 NaN NaN -27.117728 NaN -49.382160 NaN NaN NaN -21.871362 NaN 0.0 NaN NaN NaN NaN NaN NaN NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -27.563146 NaN NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 199.651245 NaN NaN NaN NaN NaN NaN -22.642384
3 1908 NaN NaN NaN NaN NaN NaN 0.000000 -23.013510 NaN NaN NaN NaN -24.690649 NaN NaN 29.60199 NaN NaN NaN 20.623816 NaN NaN NaN NaN NaN NaN NaN 0.000000 NaN -9.160582 NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN -26.649046 NaN 134.489218 NaN NaN -18.672328 NaN -32.619801 NaN NaN NaN 3.944407 NaN 0.0 NaN NaN NaN NaN NaN NaN NaN 5.721393 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -32.487562 NaN NaN 105.666114 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 NaN NaN 3070.398010 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -19.549222 NaN NaN NaN NaN NaN NaN 0.000000
4 1912 NaN NaN NaN -26.092774 NaN NaN 0.000000 6.254438 NaN NaN NaN NaN -20.858191 NaN NaN 0.00000 NaN NaN NaN -20.219099 NaN NaN NaN NaN NaN NaN NaN 0.000000 NaN 76.534540 NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 54.944477 -20.564982 NaN 2.793012 NaN NaN -6.634382 NaN -24.518979 NaN NaN NaN 13.729899 NaN 0.0 NaN NaN NaN NaN NaN NaN NaN 62.430575 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -1.912744 NaN NaN 45.846353 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 108.983051 161.977401 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 NaN NaN 71.896226 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -12.105733 NaN NaN NaN NaN NaN NaN 0.000000
In [224]:
fractions_change.tail()
Out[224]:
<
NOC Edition AFG AHO ALG ANZ ARG ARM AUS AUT AZE BAH BAR BDI BEL BER BLR BOH BRA BUL BWI CAN CHI CHN CIV CMR COL CRC CRO CUB CZE DEN DJI DOM ECU EGY ERI ESP EST ETH EUA EUN FIN FRA FRG GBR GDR GEO GER GHA GRE GUY HAI HKG HUN INA IND IOP IRI IRL IRQ ISL ISR ISV ITA JAM JPN KAZ KEN KGZ KOR KSA KUW LAT LIB LTU LUX MAR MAS MDA MEX MGL MKD MOZ MRI NAM NED NGR NIG NOR NZL PAK PAN PAR PER PHI POL POR PRK PUR QAT ROU RSA RU1 RUS SCG SEN SIN SLO SRB SRI SUD SUI SUR SVK SWE SYR TAN TCH TGA THA TJK TOG TPE TRI TUN TUR UAE UGA UKR URS URU USA UZB VEN VIE YUG ZAM ZIM ZZX
21 1992 NaN 0.0 -7.214076 0.0 -6.767308 NaN 2.754114 -3.034840 NaN -24.111659 NaN NaN -4.842805 0.0 NaN 0.0 -0.744112 -5.670409 0.0 0.175522 0.000000 4.240630 0.0 0.000000 -13.615510 0.000000 NaN 31.154706 NaN -2.965992 0.0 0.000000 NaN 0.000000 NaN 32.943248 -15.272403 0.313086 0.0 NaN -4.446724 -2.246345 0.0 -2.731100 0.0 NaN 2.172297 167.686073 -7.400672 0.0 0.0 NaN -2.260967 40.674487 0.000000 NaN -4.863231 -5.779911 0.0 0.000000 NaN 0.0 -2.358676 -2.946081 -0.484390 NaN -3.902657 NaN 14.789187 NaN NaN -3.346081 0.0 NaN 0.0 4.687834 NaN NaN -6.282029 -5.830225 NaN NaN NaN NaN -1.541059 66.973443 0.0 -2.845541 2.032222 -2.482733 0.00000 NaN -20.453817 -11.275721 -0.301451 0.000000 -0.281514 -11.929985 NaN 0.676152 -6.308757 0.0 NaN NaN 0.0 0.000000 NaN NaN 0.000000 NaN -4.856259 -4.662757 NaN -2.992170 0.000000 0.0 -4.855929 NaN -4.038382 NaN NaN 290.427599 0.000000 0.000000 -2.767259 NaN 0.000000 NaN 0.0 0.000000 -1.329330 NaN 0.000000 NaN 0.000000 0.000000 0.000000 0.0
22 1996 NaN 0.0 8.959211 0.0 1.306696 NaN 10.743275 -3.876773 NaN 16.793717 NaN NaN -4.230212 0.0 NaN 0.0 20.111002 -4.363518 0.0 0.468867 0.000000 7.860247 0.0 0.000000 0.000000 -8.418505 38.658777 11.355483 NaN -1.767750 0.0 0.000000 NaN 0.000000 NaN 19.283123 0.000000 -0.978965 0.0 0.0 -4.574592 -2.529696 0.0 -3.496901 0.0 NaN -2.870081 0.000000 -6.069972 0.0 0.0 NaN -2.422847 6.078215 -7.434730 0.0 -4.589029 0.518200 0.0 0.000000 -27.071006 0.0 -1.114834 17.777438 -1.362980 NaN -3.843949 NaN 15.161867 NaN NaN -17.886191 0.0 -7.669549 0.0 -6.128417 18.786982 NaN -5.880390 -9.762539 NaN NaN NaN -4.142012 2.286353 75.528000 0.0 -2.612566 -1.851608 0.000000 0.00000 NaN 0.000000 -9.509095 -3.505201 -4.506051 -10.906079 -9.111125 0.000000 -2.067262 -4.643242 0.0 NaN NaN 0.0 0.000000 -34.714004 NaN 0.000000 NaN -3.310301 0.000000 NaN -3.160616 -10.758472 0.0 0.000000 NaN 12.054737 NaN NaN -17.036096 -11.549636 -16.531330 -2.865545 NaN -15.722487 NaN 0.0 0.000000 -1.010378 NaN 0.000000 NaN -2.667732 -10.758472 0.000000 0.0
23 2000 NaN 0.0 19.762488 0.0 0.515190 -26.935484 12.554986 -3.464221 88.387097 11.693273 NaN 0.0 -3.937948 0.0 -5.876578 0.0 7.987043 -5.745491 0.0 -1.758520 44.530287 -3.851278 0.0 326.433237 -11.078926 22.518245 -19.772404 10.379886 -18.064516 -1.748505 0.0 0.000000 0.000000 0.000000 NaN 5.576306 -12.732220 16.428481 0.0 0.0 -4.272276 -1.986048 0.0 -2.517516 0.0 88.387097 -3.135334 0.000000 -4.453713 0.0 0.0 0.000000 -1.943492 9.276908 -6.904472 0.0 -3.012862 -7.598425 0.0 -15.108035 -13.995943 0.0 -1.571947 19.343853 -1.580262 -20.645161 -4.829604 NaN 11.365816 NaN NaN -0.817660 0.0 6.614083 0.0 11.735730 0.000000 -19.247312 -1.914668 0.000000 NaN -3.870968 NaN 0.000000 2.011575 -3.792355 0.0 -0.614563 -4.152920 0.000000 0.00000 NaN 0.000000 0.000000 -3.059593 -6.171254 -10.388741 0.000000 -7.692308 -1.264420 -4.403586 0.0 25.410940 NaN 0.0 0.000000 -11.732125 NaN -29.801489 NaN -2.862764 0.000000 42.258065 -3.042289 0.000000 0.0 0.000000 0.0 16.322880 NaN NaN -2.933583 -9.321547 0.000000 -3.697468 NaN 0.000000 -2.514231 0.0 -12.025323 -1.341842 42.258065 0.000000 NaN -2.696445 0.000000 0.000000 0.0
24 2004 NaN 0.0 0.000000 0.0 9.625365 0.000000 8.161162 -2.186922 48.982144 -8.717582 0.0 0.0 -4.191578 0.0 -8.978451 0.0 4.441670 -4.349906 0.0 -2.847128 -8.328117 0.128863 0.0 -21.454822 0.196563 0.000000 0.570340 6.083354 1.607119 -1.200641 0.0 -13.488488 0.000000 -4.340613 NaN 0.634407 -10.388427 8.378425 0.0 0.0 -4.348500 -2.296504 0.0 -2.302551 0.0 -0.407142 -1.437304 0.000000 -2.478635 0.0 0.0 43.043043 -2.415933 -4.183048 -6.425639 0.0 -0.032026 0.000000 0.0 0.000000 9.013377 0.0 0.225280 4.116286 3.545940 -4.909245 -3.886027 0.000000 4.037571 0.0 0.0 5.005598 0.0 -19.998650 0.0 -1.101722 0.000000 0.000000 -3.085448 -8.419170 0.0 0.000000 NaN 0.000000 1.603836 -2.718009 0.0 -4.287069