Post

Manipulating DataFrames with pandas

1
2
3
4
5
6
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy import NaN
from scipy.stats import zscore 
import os

Tested with the following package versions

1
2
3
4
print('Pandas version:', pd.__version__)
print('Numpy version:', np.__version__)
print('Matplotlib version:', plt.matplotlib.__version__)
print('Scipy version:', scipy.__version__)  # import scipy
Pandas version: 2.2.1
Numpy version: 1.26.4
Matplotlib version: 3.8.1
Scipy version: 1.12.0

Set Display Options

1
2
3
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)
pd.set_option('display.expand_frame_repr', True)

Data Files Location

Data File Objects

1
2
3
4
5
6
7
8
9
10
11
election_penn = 'data/manipulating-dataframes-with-pandas/2012_US_election_results_(Pennsylvania).csv'
gapminder_data = 'data/manipulating-dataframes-with-pandas/gapminder.csv'
medals_data = 'data/manipulating-dataframes-with-pandas/olympic_medals.csv'
weather_data = 'data/manipulating-dataframes-with-pandas/Pittsburgh_weather_data.csv'
sales_data = 'data/manipulating-dataframes-with-pandas/sales.csv'
sales2_data = 'data/manipulating-dataframes-with-pandas/sales2.csv'
sales_feb = 'data/manipulating-dataframes-with-pandas/sales-feb-2015.csv'
titanic_data = 'data/manipulating-dataframes-with-pandas/titanics.csv'
users_data = 'data/manipulating-dataframes-with-pandas/users.csv'
massachusetts_labor = 'data/manipulating-dataframes-with-pandas/LURReport.csv'
auto_mpg = 'data/manipulating-dataframes-with-pandas/auto-mpg.csv'

Manipulating DataFrames with pandas

Course Description

In this course, you’ll learn how to leverage pandas’ extremely powerful data manipulation engine to get the most out of your data. It is important to be able to extract, filter, and transform data from DataFrames in order to drill into the data that really matters. The pandas library has many techniques that make this process efficient and intuitive. You will learn how to tidy, rearrange, and restructure your data by pivoting or melting and stacking or unstacking DataFrames. These are all fundamental next steps on the road to becoming a well-rounded Data Scientist, and you will have the chance to apply all the concepts you learn to real-world datasets.

What You’ll Learn

  • Extracting, filtering, and transforming data from DataFrames
  • Advanced indexing with multiple levels
  • Tidying, rearranging and restructuring your data
  • Pivoting, melting, and stacking DataFrames
  • Identifying and spli!ing DataFrames by groups

Extracting and transforming data

In this chapter, you will learn all about how to index, slice, filter, and transform DataFrames, using a variety of datasets, ranging from 2012 US election data for the state of Pennsylvania to Pittsburgh weather data.

Indexing DataFrames

A simple DataFrame

1
2
df = pd.read_csv(sales_data, index_col='month')
df
eggssaltspam
month
Jan4712.017
Feb11050.031
Mar22189.072
Apr7787.020
May132NaN52
Jun20560.055

Indexing using square brackets

1
df['salt']['Jan']
1
12.0

Using column attribute and row label

1
df.eggs['Mar']
1
221

Accessors

  • A more efficient and more programmatically reusable method of accessing data in a DataFrame is by using accessors
    • .loc - accesses using lables
    • .iloc - accesses using index positions
  • Both accessors use left bracket, row specifier, comma, column specifier, right bracket as syntax
Using the .loc accessor
1
df.loc['May', 'spam']
1
52
Using the .iloc accessor
1
df.iloc[4, 2]
1
52

Selecting only some columns

  • When using bracket-indexing without the .loc or .iloc accessors, the result returned can be an individual value, Pandas Series, or Pandas DataFrame.
  • To ensure the return value is a DataFrame, use a nested list within square brackets
1
2
df_new = df[['salt','eggs']]
df_new
salteggs
month
Jan12.047
Feb50.0110
Mar89.0221
Apr87.077
MayNaN132
Jun60.0205

Exercises

Index ordering

In this exercise, the DataFrame election is provided for you. It contains the 2012 US election results for the state of Pennsylvania with county names as row indices. Your job is to select ‘Bedford’ county and the ‘winner’ column. Which method is the preferred way?

1
2
election = pd.read_csv(election_penn, index_col='county')
election.head()
statetotalObamaRomneywinnervotersturnoutmargin
county
AdamsPA4197335.48233463.112001Romney6115668.63267727.629667
AlleghenyPA61467156.64021942.185820Obama92435166.49757514.454399
ArmstrongPA2832230.69698567.901278Romney4214767.19814037.204293
BeaverPA8001546.03261952.637630Romney11515769.4834016.605012
BedfordPA2144422.05745276.986570Romney3218966.61903154.929118
1
election.loc['Bedford', 'winner']
1
'Romney'

Positional and labeled indexing

Given a pair of label-based indices, sometimes it’s necessary to find the corresponding positions. In this exercise, you will use the Pennsylvania election results again. The DataFrame is provided for you as election.

Find x and y such that election.iloc[x, y] == election.loc[‘Bedford’, ‘winner’]. That is, what is the row position of ‘Bedford’, and the column position of ‘winner’? Remember that the first position in Python is 0, not 1!

To answer this question, first explore the DataFrame using election.head() in the IPython Shell and inspect it with your eyes.

Instructions

  • Explore the DataFrame in the IPython Shell using election.head().
  • Assign the row position of election.loc[‘Bedford’] to x.
  • Assign the column position of election[‘winner’] to y.
  • Hit ‘Submit Answer’ to print the boolean equivalence of the .loc and .iloc selections.
1
2
3
4
5
6
7
8
# Assign the row position of election.loc['Bedford']: x
x = 4

# Assign the column position of election['winner']: y
y = 4

# Print the boolean equivalence
print(election.iloc[x, y] == election.loc['Bedford', 'winner'])
1
True

Depending on the situation, you may wish to use .iloc[] over .loc[], and vice versa. The important thing to realize is you can achieve the exact same results using either approach.

Indexing and column rearrangement

There are circumstances in which it’s useful to modify the order of your DataFrame columns. We do that now by extracting just two columns from the Pennsylvania election results DataFrame.

Your job is to read the CSV file and set the index to ‘county’. You’ll then assign a new DataFrame by selecting the list of columns [‘winner’, ‘total’, ‘voters’]. The CSV file is provided to you in the variable filename.

Instructions

  • Import pandas as pd.
  • Read in filename using pd.read_csv() and set the index to ‘county’ by specifying the index_col parameter.
  • Create a separate DataFrame results with the columns [‘winner’, ‘total’, ‘voters’].
  • Print the output using results.head(). This has been done for you, so hit ‘Submit Answer’ to see the new DataFrame!
1
2
3
4
5
6
7
8
# Read in filename and set the index: election
election = pd.read_csv(election_penn, index_col='county')

# Create a separate dataframe with the columns ['winner', 'total', 'voters']: results
results = election[['winner', 'total', 'voters']]

# Print the output of results.head(['winner', 'total', 'voters'])
print(results.head())
1
2
3
4
5
6
7
           winner   total  voters
county                           
Adams      Romney   41973   61156
Allegheny   Obama  614671  924351
Armstrong  Romney   28322   42147
Beaver     Romney   80015  115157
Bedford    Romney   21444   32189

The original election DataFrame had 6 columns, but as you can see, your results DataFrame now has just the 3 columns: ‘winner’, ‘total’, and ‘voters’.

Slicing DataFrames

sales DataFrame

1
2
df = pd.read_csv(sales_data, index_col='month')
df
eggssaltspam
month
Jan4712.017
Feb11050.031
Mar22189.072
Apr7787.020
May132NaN52
Jun20560.055

Selecting a column (i.e., Series)

1
df['eggs']
1
2
3
4
5
6
7
8
month
Jan     47
Feb    110
Mar    221
Apr     77
May    132
Jun    205
Name: eggs, dtype: int64
1
type(df.eggs)
1
pandas.core.series.Series

Slicing and indexing a Series

1
df['eggs'].iloc[1:4] # Part of the eggs column
1
2
3
4
5
month
Feb    110
Mar    221
Apr     77
Name: eggs, dtype: int64
1
df['eggs'].iloc[4] # The value associated with May
1
132

Using .loc[] (1)

1
df.loc[:, 'eggs':'salt'] # All rows, some columns
eggssalt
month
Jan4712.0
Feb11050.0
Mar22189.0
Apr7787.0
May132NaN
Jun20560.0

Using .loc[] (2)

1
df.loc['Jan':'Apr',:] # Some rows, all columns
eggssaltspam
month
Jan4712.017
Feb11050.031
Mar22189.072
Apr7787.020

Using .loc[] (3)

1
df.loc['Mar':'May', 'salt':'spam']
saltspam
month
Mar89.072
Apr87.020
MayNaN52

Using .iloc[]

1
df.iloc[2:5, 1:] # A block from middle of the DataFrame
saltspam
month
Mar89.072
Apr87.020
MayNaN52

Using lists rather than slices (1)

1
df.loc['Jan':'May', ['eggs', 'spam']]
eggsspam
month
Jan4717
Feb11031
Mar22172
Apr7720
May13252

Using lists rather than slices (2)####

1
df.iloc[[0,4,5], 0:2]
eggssalt
month
Jan4712.0
May132NaN
Jun20560.0

Series versus 1-column DataFrame

1
2
# A Series by column name
df['eggs']
1
2
3
4
5
6
7
8
month
Jan     47
Feb    110
Mar    221
Apr     77
May    132
Jun    205
Name: eggs, dtype: int64
1
type(df['eggs'])
1
pandas.core.series.Series
1
2
# A DataFrame w/ single column
df[['eggs']]
eggs
month
Jan47
Feb110
Mar221
Apr77
May132
Jun205
1
type(df[['eggs']])
1
pandas.core.frame.DataFrame

Exercises

Slicing rows

The Pennsylvania US election results data set that you have been using so far is ordered by county name. This means that county names can be sliced alphabetically. In this exercise, you’re going to perform slicing on the county names of the election DataFrame from the previous exercises, which has been pre-loaded for you.

Instructions

  • Slice the row labels ‘Perry’ to ‘Potter’ and assign the output to p_counties.
  • Print the p_counties DataFrame. This has been done for you.
  • Slice the row labels ‘Potter’ to ‘Perry’ in reverse order. To do this for hypothetical row labels ‘a’ and ‘b’, you could use a stepsize of -1 like so: df.loc[‘b’:’a’:-1].
  • Print the p_counties_rev DataFrame. This has also been done for you, so hit ‘Submit Answer’ to see the result of your slicing!
1
2
3
4
5
6
7
8
9
10
11
# Slice the row labels 'Perry' to 'Potter': p_counties
p_counties = election.loc['Perry':'Potter']

# Print the p_counties DataFrame
display(p_counties)

# Slice the row labels 'Potter' to 'Perry' in reverse order: p_counties_rev
p_counties_rev = election.loc['Potter':'Perry':-1]

# Print the p_counties_rev DataFrame
p_counties_rev
statetotalObamaRomneywinnervotersturnoutmargin
county
PerryPA1824029.76973768.591009Romney2724566.94806438.821272
PhiladelphiaPA65359885.22425114.051451Obama109919759.46140771.172800
PikePA2316443.90433454.882576Romney4184055.36328910.978242
PotterPA720526.25954272.158223Romney1091366.02217545.898681
statetotalObamaRomneywinnervotersturnoutmargin
county
PotterPA720526.25954272.158223Romney1091366.02217545.898681
PikePA2316443.90433454.882576Romney4184055.36328910.978242
PhiladelphiaPA65359885.22425114.051451Obama109919759.46140771.172800
PerryPA1824029.76973768.591009Romney2724566.94806438.821272

Slicing columns

Similar to row slicing, columns can be sliced by value. In this exercise, your job is to slice column names from the Pennsylvania election results DataFrame using .loc[].

It has been pre-loaded for you as election, with the index set to ‘county’.

Instructions

  • Slice the columns from the starting column to ‘Obama’ and assign the result to left_columns
  • Slice the columns from ‘Obama’ to ‘winner’ and assign the result to middle_columns
  • Slice the columns from ‘Romney’ to the end and assign the result to right_columns
  • The code to print the first 5 rows of left_columns, middle_columns, and right_columns has been written, so hit ‘Submit Answer’ to see the results!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Slice the columns from the starting column to 'Obama': left_columns
left_columns = election.loc[:, :'Obama']

# Print the output of left_columns.head()
print('Left Columns: \n', left_columns.head())

# Slice the columns from 'Obama' to 'winner': middle_columns
middle_columns = election.loc[:, 'Obama':'winner']

# Print the output of middle_columns.head()
print('\nMiddle Columns: \n', middle_columns.head())

# Slice the columns from 'Romney' to the end: 'right_columns'
right_columns = election.loc[:, 'Romney':]

# Print the output of right_columns.head()
print('\nRight Columns: \n', right_columns.head())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Left Columns: 
           state   total      Obama
county                            
Adams        PA   41973  35.482334
Allegheny    PA  614671  56.640219
Armstrong    PA   28322  30.696985
Beaver       PA   80015  46.032619
Bedford      PA   21444  22.057452

Middle Columns: 
                Obama     Romney  winner
county                                 
Adams      35.482334  63.112001  Romney
Allegheny  56.640219  42.185820   Obama
Armstrong  30.696985  67.901278  Romney
Beaver     46.032619  52.637630  Romney
Bedford    22.057452  76.986570  Romney

Right Columns: 
               Romney  winner  voters    turnout     margin
county                                                    
Adams      63.112001  Romney   61156  68.632677  27.629667
Allegheny  42.185820   Obama  924351  66.497575  14.454399
Armstrong  67.901278  Romney   42147  67.198140  37.204293
Beaver     52.637630  Romney  115157  69.483401   6.605012
Bedford    76.986570  Romney   32189  66.619031  54.929118

Subselecting DataFrames with lists

You can use lists to select specific row and column labels with the .loc[] accessor. In this exercise, your job is to select the counties [‘Philadelphia’, ‘Centre’, ‘Fulton’] and the columns [‘winner’,’Obama’,’Romney’] from the election DataFrame, which has been pre-loaded for you with the index set to ‘county’.

Instructions

  • Create the list of row labels [‘Philadelphia’, ‘Centre’, ‘Fulton’] and assign it to rows.
  • Create the list of column labels [‘winner’, ‘Obama’, ‘Romney’] and assign it to cols.
  • Create a new DataFrame by selecting with rows and cols in .loc[] and assign it to three_counties.
  • Print the three_counties DataFrame. This has been done for you, so hit ‘Submit Answer` to see your new DataFrame.
1
2
3
4
5
6
7
8
9
10
11
# Create the list of row labels: rows
rows = ['Philadelphia', 'Centre', 'Fulton']

# Create the list of column labels: cols
cols = ['winner', 'Obama', 'Romney']

# Create the new DataFrame: three_counties
three_counties = election.loc[rows, cols]

# Print the three_counties DataFrame
print(three_counties)
1
2
3
4
5
              winner      Obama     Romney
county                                    
Philadelphia   Obama  85.224251  14.051451
Centre        Romney  48.948416  48.977486
Fulton        Romney  21.096291  77.748861

If you know exactly which rows and columns are of interest to you, this is a useful approach for subselecting DataFrames.

Filtering DataFrames

Data

1
2
df = pd.read_csv(sales_data, index_col='month')
df
eggssaltspam
month
Jan4712.017
Feb11050.031
Mar22189.072
Apr7787.020
May132NaN52
Jun20560.055

Creating a Boolean Series

1
df.salt > 60
1
2
3
4
5
6
7
8
month
Jan    False
Feb    False
Mar     True
Apr     True
May    False
Jun    False
Name: salt, dtype: bool

Filtering with a Boolean Series

1
df[df.salt > 60]
eggssaltspam
month
Mar22189.072
Apr7787.020
1
enough_salt_sold = df.salt > 60
1
df[enough_salt_sold]
eggssaltspam
month
Mar22189.072
Apr7787.020

Combining filters

1
df[(df.salt >= 50) & (df.eggs < 200)] # Both conditions
eggssaltspam
month
Feb11050.031
Apr7787.020
1
df[(df.salt >= 50) | (df.eggs < 200)] # Either condition
eggssaltspam
month
Jan4712.017
Feb11050.031
Mar22189.072
Apr7787.020
May132NaN52
Jun20560.055

DataFrames with zeros and NaNs

1
df2 = df.copy()
1
df2['bacon'] = [0, 0, 50, 60, 70, 80]
1
df2
eggssaltspambacon
month
Jan4712.0170
Feb11050.0310
Mar22189.07250
Apr7787.02060
May132NaN5270
Jun20560.05580

Select columns with all nonzeros

1
df2.loc[:, df2.all()]
eggssaltspam
month
Jan4712.017
Feb11050.031
Mar22189.072
Apr7787.020
May132NaN52
Jun20560.055

Select columns with any nonzeros

1
df2.loc[:, df2.any()]
eggssaltspambacon
month
Jan4712.0170
Feb11050.0310
Mar22189.07250
Apr7787.02060
May132NaN5270
Jun20560.05580

Select columns with any NaNs

1
df.loc[:, df.isnull().any()]
salt
month
Jan12.0
Feb50.0
Mar89.0
Apr87.0
MayNaN
Jun60.0

Select columns without NaNs

1
df.loc[:, df.notnull().all()]
eggsspam
month
Jan4717
Feb11031
Mar22172
Apr7720
May13252
Jun20555

Drop rows with any NaNs

1
df.dropna(how='any')
eggssaltspam
month
Jan4712.017
Feb11050.031
Mar22189.072
Apr7787.020
Jun20560.055

Filtering a column based on another

1
df.eggs[df.salt > 55]
1
2
3
4
5
month
Mar    221
Apr     77
Jun    205
Name: eggs, dtype: int64

Modifying a column based on another

1
2
df.loc[df.salt > 55, 'eggs'] += 5
df
eggssaltspam
month
Jan4712.017
Feb11050.031
Mar22689.072
Apr8287.020
May132NaN52
Jun21060.055

Exercises

Thresholding data

In this exercise, we have provided the Pennsylvania election results and included a column called ‘turnout’ that contains the percentage of voter turnout per county. Your job is to prepare a boolean array to select all of the rows and columns where voter turnout exceeded 70%.

As before, the DataFrame is available to you as election with the index set to ‘county’.

Instructions

  • Create a boolean array of the condition where the ‘turnout’ column is greater than 70 and assign it to high_turnout.
  • Filter the election DataFrame with the high_turnout array and assign it to high_turnout_df.
  • Print the filtered DataFrame. This has been done for you, so hit ‘Submit Answer’ to see it!
1
election = pd.read_csv(election_penn, index_col='county')
1
2
3
4
5
6
7
8
# Create the boolean array: high_turnout
high_turnout = election.turnout > 70

# Filter the election DataFrame with the high_turnout array: high_turnout_df
high_turnout_df = election[high_turnout]

# Print the high_turnout_results DataFrame
high_turnout_df
statetotalObamaRomneywinnervotersturnoutmargin
county
BucksPA31940749.96697048.801686Obama43560673.3247481.165284
ButlerPA8892431.92051666.816607Romney12276272.43609634.896091
ChesterPA24829549.22853949.650617Romney33782273.4987660.422079
ForestPA230838.73483559.835355Romney323271.41089121.100520
FranklinPA6280230.11050668.583803Romney8740671.85090338.473297
MontgomeryPA40178756.63722342.286834Obama55110572.90570814.350390
WestmorelandPA16870937.56764661.306154Romney23800670.88434723.738508

Filtering columns using other columns

The election results DataFrame has a column labeled ‘margin’ which expresses the number of extra votes the winner received over the losing candidate. This number is given as a percentage of the total votes cast. It is reasonable to assume that in counties where this margin was less than 1%, the results would be too-close-to-call.

Your job is to use boolean selection to filter the rows where the margin was less than 1. You’ll then convert these rows of the ‘winner’ column to np.nan to indicate that these results are too close to declare a winner.

The DataFrame has been pre-loaded for you as election.

Instructions

  • Import numpy as np.
  • Create a boolean array for the condition where the ‘margin’ column is less than 1 and assign it to too_close.
  • Convert the entries in the ‘winner’ column where the result was too close to call to np.nan.
  • Print the output of election.info(). This has been done for you, so hit ‘Submit Answer’ to see the results.
1
2
3
4
5
6
7
8
# Create the boolean array: too_close
too_close = election.margin < 1

# Assign np.nan to the 'winner' column where the results were too close to call
election.loc[too_close, 'winner'] = NaN

# Print the output of election.info()
election.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<class 'pandas.core.frame.DataFrame'>
Index: 67 entries, Adams to York
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   state    67 non-null     object 
 1   total    67 non-null     int64  
 2   Obama    67 non-null     float64
 3   Romney   67 non-null     float64
 4   winner   64 non-null     object 
 5   voters   67 non-null     int64  
 6   turnout  67 non-null     float64
 7   margin   67 non-null     float64
dtypes: float64(4), int64(2), object(2)
memory usage: 4.7+ KB

Filtering using NaNs

In certain scenarios, it may be necessary to remove rows and columns with missing data from a DataFrame. The .dropna() method is used to perform this action. You’ll now practice using this method on a dataset obtained from Vanderbilt University, which consists of data from passengers on the Titanic.

The DataFrame has been pre-loaded for you as titanic. Explore it in the IPython Shell and you will note that there are many NaNs. You will focus specifically on the ‘age’ and ‘cabin’ columns in this exercise. Your job is to use .dropna() to remove rows where any of these two columns contains missing data and rows where all of these two columns contain missing data.

You’ll also use the .shape attribute, which returns the number of rows and columns in a tuple from a DataFrame, or the number of rows from a Series, to see the effect of dropping missing values from a DataFrame.

Finally, you’ll use the thresh= keyword argument to drop columns from the full dataset that have less than 1000 non-missing values.

Instructions

  • Select the ‘age’ and ‘cabin’ columns of titanic and create a new DataFrame df.
  • Print the shape of df. This has been done for you.
  • Drop rows in df with how=’any’ and print the shape.
  • Drop rows in df with how=’all’ and print the shape.
  • Drop columns from the titanic DataFrame that have less than 1000 non-missing values by specifying the thresh and axis keyword arguments. Print the output of .info() from this.
1
titanic = pd.read_csv(titanic_data)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Select the 'age' and 'cabin' columns: df
df = titanic[['age', 'cabin']]

# Print the shape of df
print(df.shape)

# Drop rows in df with how='any' and print the shape
print('\n', df.dropna(how='any').shape)

# Drop rows in df with how='all' and print the shape
print('\n', df.dropna(how='all').shape)

# Drop columns in titanic with less than 1000 non-missing values
print('\n', titanic.dropna(thresh=1000, axis='columns').info())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
(1309, 2)

 (272, 2)

 (1069, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    1309 non-null   int64  
 1   survived  1309 non-null   int64  
 2   name      1309 non-null   object 
 3   sex       1309 non-null   object 
 4   age       1046 non-null   float64
 5   sibsp     1309 non-null   int64  
 6   parch     1309 non-null   int64  
 7   ticket    1309 non-null   object 
 8   fare      1308 non-null   float64
 9   embarked  1307 non-null   object 
dtypes: float64(2), int64(4), object(4)
memory usage: 102.4+ KB

 None

Transforming DataFrames

Data

1
2
df = pd.read_csv(sales_data, index_col='month')
df
eggssaltspam
month
Jan4712.017
Feb11050.031
Mar22189.072
Apr7787.020
May132NaN52
Jun20560.055

DataFrame vectorized methods

1
df.floordiv(12) # Convert to dozens unit
eggssaltspam
month
Jan31.01
Feb94.02
Mar187.06
Apr67.01
May11NaN4
Jun175.04

NumPy vectorized functions

1
np.floor_divide(df, 12) # Convert to dozens unit
eggssaltspam
month
Jan31.01
Feb94.02
Mar187.06
Apr67.01
May11NaN4
Jun175.04

Plain Python functions (1)

1
2
def dozens(n):
    return n//12
1
df.apply(dozens)  # Convert to dozens unit
eggssaltspam
month
Jan31.01
Feb94.02
Mar187.06
Apr67.01
May11NaN4
Jun175.04

Plain Python functions (2)

1
df.apply(lambda n: n//12)
eggssaltspam
month
Jan31.01
Feb94.02
Mar187.06
Apr67.01
May11NaN4
Jun175.04

Storing a transformation

1
2
df['dozens_of_eggs'] = df.eggs.floordiv(12)
df
eggssaltspamdozens_of_eggs
month
Jan4712.0173
Feb11050.0319
Mar22189.07218
Apr7787.0206
May132NaN5211
Jun20560.05517

The DataFrame index

1
df.index
1
Index(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'], dtype='object', name='month')

Working with string values (1)

1
2
df.index = df.index.str.upper()
df
eggssaltspamdozens_of_eggs
month
JAN4712.0173
FEB11050.0319
MAR22189.07218
APR7787.0206
MAY132NaN5211
JUN20560.05517

Working with string values (2)

1
2
df.index = df.index.map(str.lower)
df
eggssaltspamdozens_of_eggs
month
jan4712.0173
feb11050.0319
mar22189.07218
apr7787.0206
may132NaN5211
jun20560.05517

Defining columns using other columns

1
2
df['salty_eggs'] = df.salt + df.dozens_of_eggs
df
eggssaltspamdozens_of_eggssalty_eggs
month
jan4712.017315.0
feb11050.031959.0
mar22189.07218107.0
apr7787.020693.0
may132NaN5211NaN
jun20560.0551777.0

Exercises

Using apply() to transform a column

The .apply() method can be used on a pandas DataFrame to apply an arbitrary Python function to every element. In this exercise you’ll take daily weather data in Pittsburgh in 2013 obtained from Weather Underground.

A function to convert degree Fahrenheit to degree Celsius has been written for you. Your job is to use the .apply() method to perform this conversion on the ‘Mean TemperatureF’ and ‘Mean Dew PointF’ columns of the weather DataFrame.

Instructions

  • Apply the to_celsius() function over the [‘Mean TemperatureF’,’Mean Dew PointF’] columns of the weather DataFrame.
  • Reassign the columns of df_celsius to [‘Mean TemperatureC’,’Mean Dew PointC’].
  • Hit ‘Submit Answer’ to see the new DataFrame with the converted units.
1
weather = pd.read_csv(weather_data)
1
2
3
# Write a function to convert degree Fahrenheit to degree Celsius: to_celsius
def to_celsius(F):
    return 5/9*(F - 32)
1
2
3
# Apply the function over 'Mean TemperatureF' and 'Mean Dew PointF': df_celsius
df_celsius = weather[['Mean TemperatureF', 'Mean Dew PointF']].apply(to_celsius)
df_celsius.head()
Mean TemperatureFMean Dew PointF
0-2.222222-2.777778
1-6.111111-11.111111
2-4.444444-9.444444
3-2.222222-7.222222
4-1.111111-6.666667
1
2
3
# Reassign the columns df_celsius
df_celsius.columns = ['Mean TemperatureC', 'Mean Dew PointC']
df_celsius.head()
Mean TemperatureCMean Dew PointC
0-2.222222-2.777778
1-6.111111-11.111111
2-4.444444-9.444444
3-2.222222-7.222222
4-1.111111-6.666667

Using .map() with a dictionary

The .map() method is used to transform values according to a Python dictionary look-up. In this exercise you’ll practice this method while returning to working with the election DataFrame, which has been pre-loaded for you.

Your job is to use a dictionary to map the values ‘Obama’ and ‘Romney’ in the ‘winner’ column to the values ‘blue’ and ‘red’, and assign the output to the new column ‘color’.

Instructions

  • Create a dictionary with the key:value pairs ‘Obama’:’blue’ and ‘Romney’:’red’.
  • Use the .map() method on the ‘winner’ column using the red_vs_blue dictionary you created.
  • Print the output of election.head(). This has been done for you, so hit ‘Submit Answer’ to see the new column!
1
2
election =  pd.read_csv(election_penn, index_col='county')
election.head()
statetotalObamaRomneywinnervotersturnoutmargin
county
AdamsPA4197335.48233463.112001Romney6115668.63267727.629667
AlleghenyPA61467156.64021942.185820Obama92435166.49757514.454399
ArmstrongPA2832230.69698567.901278Romney4214767.19814037.204293
BeaverPA8001546.03261952.637630Romney11515769.4834016.605012
BedfordPA2144422.05745276.986570Romney3218966.61903154.929118
1
2
3
4
5
6
7
8
# Create the dictionary: red_vs_blue
red_vs_blue = {'Obama':'blue', 'Romney':'red'}

# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election.winner.map(red_vs_blue)

# Print the output of election.head()
election.head()
statetotalObamaRomneywinnervotersturnoutmargincolor
county
AdamsPA4197335.48233463.112001Romney6115668.63267727.629667red
AlleghenyPA61467156.64021942.185820Obama92435166.49757514.454399blue
ArmstrongPA2832230.69698567.901278Romney4214767.19814037.204293red
BeaverPA8001546.03261952.637630Romney11515769.4834016.605012red
BedfordPA2144422.05745276.986570Romney3218966.61903154.929118red

Using vectorized functions

When performance is paramount, you should avoid using .apply() and .map() because those constructs perform Python for-loops over the data stored in a pandas Series or DataFrame. By using vectorized functions instead, you can loop over the data at the same speed as compiled code (C, Fortran, etc.)! NumPy, SciPy and pandas come with a variety of vectorized functions (called Universal Functions or UFuncs in NumPy).

You can even write your own vectorized functions, but for now we will focus on the ones distributed by NumPy and pandas.

In this exercise you’re going to import the zscore function from scipy.stats and use it to compute the deviation in voter turnout in Pennsylvania from the mean in fractions of the standard deviation. In statistics, the z-score is the number of standard deviations by which an observation is above the mean - so if it is negative, it means the observation is below the mean.

Instead of using .apply() as you did in the earlier exercises, the zscore UFunc will take a pandas Series as input and return a NumPy array. You will then assign the values of the NumPy array to a new column in the DataFrame. You will be working with the election DataFrame - it has been pre-loaded for you.

Instructions

  • Import zscore from scipy.stats.
  • Call zscore with election[‘turnout’] as input .
  • Print the output of type(turnout_zscore). This has been done for you.
  • Assign turnout_zscore to a new column in election as ‘turnout_zscore’.
  • Print the output of election.head(). This has been done for you, so hit ‘Submit Answer’ to view the result.
1
2
3
4
5
6
7
8
9
10
11
# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(election.turnout)

# Print the type of turnout_zscore
print('Type: \n', type(turnout_zscore), '\n')

# Assign turnout_zscore to a new column: election['turnout_zscore']
election['turnout_zscore'] = turnout_zscore

# Print the output of election.head()
election.head()
1
2
Type: 
 <class 'pandas.core.series.Series'> 
statetotalObamaRomneywinnervotersturnoutmargincolorturnout_zscore
county
AdamsPA4197335.48233463.112001Romney6115668.63267727.629667red0.853734
AlleghenyPA61467156.64021942.185820Obama92435166.49757514.454399blue0.439846
ArmstrongPA2832230.69698567.901278Romney4214767.19814037.204293red0.575650
BeaverPA8001546.03261952.637630Romney11515769.4834016.605012red1.018647
BedfordPA2144422.05745276.986570Romney3218966.61903154.929118red0.463391

Advanced Indexing

Having learned the fundamentals of working with DataFrames, you will now move on to more advanced indexing techniques. You will learn about MultiIndexes, or hierarchical indexes, and learn how to interact with and extract data from them.

Index objects and labeled data

pandas Data Structures

  • Key building blocks
    • Indexes: Sequence of lables
    • Series: 1D array with index
    • DataFrames: 2D array with Series as columns
  • Indexes
    • Immutable (Like dictionary keys)
    • Homogenous in data type (Like NumPy arrays)

Creating a Series

1
2
3
prices = [10.70, 10.86, 10.74, 10.71, 10.79]
shares = pd.Series(prices)
shares
1
2
3
4
5
6
0    10.70
1    10.86
2    10.74
3    10.71
4    10.79
dtype: float64

Creating an index

1
2
3
days = ['Mon', 'Tue', 'Wed', 'Thur', 'Fri']
shares = pd.Series(prices, index=days)
shares
1
2
3
4
5
6
Mon     10.70
Tue     10.86
Wed     10.74
Thur    10.71
Fri     10.79
dtype: float64

Examining an index

1
2
3
4
5
print(shares.index)
print(shares.index[2])
print(shares.index[:2])
print(shares.index[-2:])
print(shares.index.name)
1
2
3
4
5
Index(['Mon', 'Tue', 'Wed', 'Thur', 'Fri'], dtype='object')
Wed
Index(['Mon', 'Tue'], dtype='object')
Index(['Thur', 'Fri'], dtype='object')
None

Modifying index name

1
2
shares.index.name = 'weekday'
shares
1
2
3
4
5
6
7
weekday
Mon     10.70
Tue     10.86
Wed     10.74
Thur    10.71
Fri     10.79
dtype: float64

Modifying index entries

1
2
3
4
try:
    shares.index[2] = 'Wednesday'
except TypeError:
    print('TypeError: Index does not support mutable operations')
1
TypeError: Index does not support mutable operations
1
2
3
4
try:
    shares.index[:4]= ['Monday', 'Tuesday', 'Wednesday', 'Thursday']
except TypeError:
    print('TypeError: Index does not support mutable operations')
1
TypeError: Index does not support mutable operations

Modifying all index entries

1
2
shares.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
shares
1
2
3
4
5
6
Monday       10.70
Tuesday      10.86
Wednesday    10.74
Thursday     10.71
Friday       10.79
dtype: float64

Unemployment data - Massachusetts

1
2
3
4
5
6
7
8
9
10
11
# Read the csv file
unemployment = pd.read_csv('data/manipulating-dataframes-with-pandas/LURReport.csv')

# Combine 'Year' and 'Month' columns and convert to datetime
unemployment['Year_Month'] = pd.to_datetime(unemployment['Year'].astype(str) + '-' + unemployment['Month'].astype(str), format='%Y-%B')

# Drop the 'Year' and 'Month' columns
unemployment.drop(['Year', 'Month'], axis=1, inplace=True)

# Show the first 5 rows
unemployment.head()
AreaLabor ForceEmployedUnemployedArea RateMassachusetts RateYear_Month
0Barnstable County112,449107,6694,7804.32.72018-12-01
1Barnstable County112,150108,1713,9793.52.62018-11-01
2Barnstable County115,665112,1273,5383.12.92018-10-01
3Barnstable County119,420115,7573,6633.13.22018-09-01
4Barnstable County129,627125,6363,9913.13.52018-08-01
1
unemployment.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182 entries, 0 to 181
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Area                 182 non-null    object        
 1   Labor Force          182 non-null    object        
 2   Employed             182 non-null    object        
 3   Unemployed           182 non-null    object        
 4   Area Rate            182 non-null    float64       
 5   Massachusetts Rate   182 non-null    float64       
 6   Year_Month           182 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(4)
memory usage: 10.1+ KB

Assigning the index

1
2
unemployment.index = unemployment.Area
unemployment.head()
AreaLabor ForceEmployedUnemployedArea RateMassachusetts RateYear_Month
Area
Barnstable CountyBarnstable County112,449107,6694,7804.32.72018-12-01
Barnstable CountyBarnstable County112,150108,1713,9793.52.62018-11-01
Barnstable CountyBarnstable County115,665112,1273,5383.12.92018-10-01
Barnstable CountyBarnstable County119,420115,7573,6633.13.22018-09-01
Barnstable CountyBarnstable County129,627125,6363,9913.13.52018-08-01

Removing extr column

1
2
del unemployment['Area']
unemployment.head()
Labor ForceEmployedUnemployedArea RateMassachusetts RateYear_Month
Area
Barnstable County112,449107,6694,7804.32.72018-12-01
Barnstable County112,150108,1713,9793.52.62018-11-01
Barnstable County115,665112,1273,5383.12.92018-10-01
Barnstable County119,420115,7573,6633.13.22018-09-01
Barnstable County129,627125,6363,9913.13.52018-08-01

Examining index & columns

1
2
3
4
print('Index: \n', unemployment.index)
print('\nIndex Name:\n', unemployment.index.name)
print('\nIndex Type:\n', type(unemployment.index))
print('\nDataFrame Columns\n', unemployment.columns)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Index: 
 Index(['Barnstable County', 'Barnstable County', 'Barnstable County',
       'Barnstable County', 'Barnstable County', 'Barnstable County',
       'Barnstable County', 'Barnstable County', 'Barnstable County',
       'Barnstable County',
       ...
       'Worcester County', 'Worcester County', 'Worcester County',
       'Worcester County', 'Worcester County', 'Worcester County',
       'Worcester County', 'Worcester County', 'Worcester County',
       'Worcester County'],
      dtype='object', name='Area', length=182)

Index Name:
 Area

Index Type:
 <class 'pandas.core.indexes.base.Index'>

DataFrame Columns
 Index(['Labor Force', 'Employed', 'Unemployed', 'Area Rate ',
       'Massachusetts Rate ', 'Year_Month'],
      dtype='object')

read_csv() with index_col()

1
2
3
4
5
6
7
8
9
10
11
# Read the csv file
unemployment = pd.read_csv('data/manipulating-dataframes-with-pandas/LURReport.csv', index_col='Area')

# Combine 'Year' and 'Month' columns and convert to datetime
unemployment['Year_Month'] = pd.to_datetime(unemployment['Year'].astype(str) + '-' + unemployment['Month'].astype(str), format='%Y-%B')

# Drop the 'Year' and 'Month' columns
unemployment.drop(['Year', 'Month'], axis=1, inplace=True)

# Show the first 5 rows
unemployment.head()
Labor ForceEmployedUnemployedArea RateMassachusetts RateYear_Month
Area
Barnstable County112,449107,6694,7804.32.72018-12-01
Barnstable County112,150108,1713,9793.52.62018-11-01
Barnstable County115,665112,1273,5383.12.92018-10-01
Barnstable County119,420115,7573,6633.13.22018-09-01
Barnstable County129,627125,6363,9913.13.52018-08-01

Exercises

Data

1
2
df = pd.read_csv(sales_data, index_col='month')
df
eggssaltspam
month
Jan4712.017
Feb11050.031
Mar22189.072
Apr7787.020
May132NaN52
Jun20560.055

Index values and names

Which one of the following index operations does not raise an error?

The sales DataFrame which you have seen in the videos of the previous chapter has been pre-loaded for you and is available for exploration in the IPython Shell.

Instructions

Possible Answers

  • sales.index[0] = ‘JAN’.
  • sales.index[0] = sales.index[0].upper().
  • sales.index = range(len(sales)).

Changing index of a DataFrame

As you saw in the previous exercise, indexes are immutable objects. This means that if you want to change or modify the index in a DataFrame, then you need to change the whole index. You will do this now, using a list comprehension to create the new index.

A list comprehension is a succinct way to generate a list in one line. For example, the following list comprehension generates a list that contains the cubes of all numbers from 0 to 9:

1
cubes = [i**3 for i in range(10)]

This is equivalent to the following code:

1
2
3
cubes = []
for i in range(10):
    cubes.append(i**3)

Before getting started, print the sales DataFrame in the IPython Shell and verify that the index is given by month abbreviations containing lowercase characters.

Instructions

  • Create a list new_idx with the same elements as in sales.index, but with all characters capitalized.
  • Assign new_idx to sales.index.
  • Print the sales dataframe. This has been done for you, so hit ‘Submit Answer’ and to see how the index changed.
1
2
3
4
5
6
7
8
# Create the list of new indexes: new_idx
new_idx = [x.upper() for x in df.index]

# Assign new_idx to sales.index
df.index = new_idx

# Print the sales DataFrame
print(df)
1
2
3
4
5
6
7
     eggs  salt  spam
JAN    47  12.0    17
FEB   110  50.0    31
MAR   221  89.0    72
APR    77  87.0    20
MAY   132   NaN    52
JUN   205  60.0    55

Changing index name labels

Notice that in the previous exercise, the index was not labeled with a name. In this exercise, you will set its name to ‘MONTHS’.

Similarly, if all the columns are related in some way, you can provide a label for the set of columns.

To get started, print the sales DataFrame in the IPython Shell and verify that the index has no name, only its data (the month names).

Instructions

  • Assign the string ‘MONTHS’ to sales.index.name to create a name for the index.
  • Print the sales dataframe to see the index name you just created.
  • Now assign the string ‘PRODUCTS’ to sales.columns.name to give a name to the set of columns.
  • Print the sales dataframe again to see the columns name you just created.
1
2
3
4
5
6
7
8
9
10
11
# Assign the string 'MONTHS' to sales.index.name
df.index.name = 'MONTHS'

# Print the sales DataFrame
print(df)

# Assign the string 'PRODUCTS' to sales.columns.name 
df.columns.name = 'PRODUCTS'

# Print the sales dataframe again
print('\n', df)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
        eggs  salt  spam
MONTHS                  
JAN       47  12.0    17
FEB      110  50.0    31
MAR      221  89.0    72
APR       77  87.0    20
MAY      132   NaN    52
JUN      205  60.0    55

 PRODUCTS  eggs  salt  spam
MONTHS                    
JAN         47  12.0    17
FEB        110  50.0    31
MAR        221  89.0    72
APR         77  87.0    20
MAY        132   NaN    52
JUN        205  60.0    55

Building an index, then a DataFrame

You can also build the DataFrame and index independently, and then put them together. If you take this route, be careful, as any mistakes in generating the DataFrame or the index can cause the data and the index to be aligned incorrectly.

In this exercise, the sales DataFrame has been provided for you without the month index. Your job is to build this index separately and then assign it to the sales DataFrame. Before getting started, print the sales DataFrame in the IPython Shell and note that it’s missing the month information.

Instructions

  • Generate a list months with the data [‘Jan’, ‘Feb’, ‘Mar’, ‘Apr’, ‘May’, ‘Jun’]. This has been done for you.
  • Assign months to sales.index.
  • Print the modified sales dataframe and verify that you now have month information in the index.
1
2
df = pd.read_csv(sales_data, usecols=['eggs', 'salt', 'spam'])
df
eggssaltspam
04712.017
111050.031
222189.072
37787.020
4132NaN52
520560.055
1
2
3
4
5
6
7
8
# Generate the list of months: months
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']

# Assign months to sales.index
df.index = months

# Print the modified sales DataFrame
print(df)
1
2
3
4
5
6
7
     eggs  salt  spam
Jan    47  12.0    17
Feb   110  50.0    31
Mar   221  89.0    72
Apr    77  87.0    20
May   132   NaN    52
Jun   205  60.0    55

Hierarchical indexing

Stock Data

1
2
3
4
5
6
7
8
9
10
11
stocks = pd.DataFrame([['2016-10-03', 31.50, 14070500, 'CSCO'],
                       ['2016-10-03', 112.52, 21701800, 'AAPL'],
                       ['2016-10-03', 57.42, 19189500, 'MSFT'],
                       ['2016-10-04', 113.00, 29736800, 'AAPL'],
                       ['2016-10-04', 57.24, 20085900, 'MSFT'],
                       ['2016-10-04', 31.35, 18460400, 'CSCO'],
                       ['2016-10-05', 57.64, 16726400, 'MSFT'],
                       ['2016-10-05', 31.59, 11808600, 'CSCO'],
                       ['2016-10-05', 113.05, 21453100, 'AAPL']],
                      columns=['Date', 'Close', 'Volume', 'Symbol'])
stocks
DateCloseVolumeSymbol
02016-10-0331.5014070500CSCO
12016-10-03112.5221701800AAPL
22016-10-0357.4219189500MSFT
32016-10-04113.0029736800AAPL
42016-10-0457.2420085900MSFT
52016-10-0431.3518460400CSCO
62016-10-0557.6416726400MSFT
72016-10-0531.5911808600CSCO
82016-10-05113.0521453100AAPL

Setting index

1
2
stocks = stocks.set_index(['Symbol', 'Date'])
stocks
CloseVolume
SymbolDate
CSCO2016-10-0331.5014070500
AAPL2016-10-03112.5221701800
MSFT2016-10-0357.4219189500
AAPL2016-10-04113.0029736800
MSFT2016-10-0457.2420085900
CSCO2016-10-0431.3518460400
MSFT2016-10-0557.6416726400
CSCO2016-10-0531.5911808600
AAPL2016-10-05113.0521453100

MultiIndex on DataFrame

1
stocks.index
1
2
3
4
5
6
7
8
9
10
MultiIndex([('CSCO', '2016-10-03'),
            ('AAPL', '2016-10-03'),
            ('MSFT', '2016-10-03'),
            ('AAPL', '2016-10-04'),
            ('MSFT', '2016-10-04'),
            ('CSCO', '2016-10-04'),
            ('MSFT', '2016-10-05'),
            ('CSCO', '2016-10-05'),
            ('AAPL', '2016-10-05')],
           names=['Symbol', 'Date'])
1
2
print(stocks.index.name)
print(stocks.index.names)
1
2
None
['Symbol', 'Date']

Sorting index

1
2
stocks = stocks.sort_index()
stocks
CloseVolume
SymbolDate
AAPL2016-10-03112.5221701800
2016-10-04113.0029736800
2016-10-05113.0521453100
CSCO2016-10-0331.5014070500
2016-10-0431.3518460400
2016-10-0531.5911808600
MSFT2016-10-0357.4219189500
2016-10-0457.2420085900
2016-10-0557.6416726400

Indexing (individual row)

1
stocks.loc[('CSCO', '2016-10-04')]
1
2
3
Close           31.35
Volume    18460400.00
Name: (CSCO, 2016-10-04), dtype: float64
1
stocks.loc[('CSCO', '2016-10-04'), 'Volume']
1
18460400

Slicing (outermost index)

1
stocks.loc['AAPL']
CloseVolume
Date
2016-10-03112.5221701800
2016-10-04113.0029736800
2016-10-05113.0521453100

Slicing (outermost index)

1
stocks.loc['CSCO':'MSFT']
CloseVolume
SymbolDate
CSCO2016-10-0331.5014070500
2016-10-0431.3518460400
2016-10-0531.5911808600
MSFT2016-10-0357.4219189500
2016-10-0457.2420085900
2016-10-0557.6416726400

Fancy indexing (outermost index)

1
stocks.loc[(['AAPL', 'MSFT'], '2016-10-05'), :]
CloseVolume
SymbolDate
AAPL2016-10-05113.0521453100
MSFT2016-10-0557.6416726400
1
stocks.loc[(['AAPL', 'MSFT'], '2016-10-05'), 'Close']
1
2
3
4
Symbol  Date      
AAPL    2016-10-05    113.05
MSFT    2016-10-05     57.64
Name: Close, dtype: float64

Fancy indexing (innermost index)

1
stocks.loc[('CSCO', ['2016-10-05', '2016-10-03']), :]
CloseVolume
SymbolDate
CSCO2016-10-0531.5911808600
2016-10-0331.5014070500

Slicing (both indexes)

1
stocks.loc[(slice(None), slice('2016-10-03', '2016-10-04')),:]
CloseVolume
SymbolDate
AAPL2016-10-03112.5221701800
2016-10-04113.0029736800
CSCO2016-10-0331.5014070500
2016-10-0431.3518460400
MSFT2016-10-0357.4219189500
2016-10-0457.2420085900

Exercises

Sales Data

1
2
sales = pd.read_csv(sales2_data, index_col=['state', 'month'])
sales
eggssaltspam
statemonth
CA14712.017
211050.031
NY122189.072
27787.020
TX1132NaN52
220560.055

Extracting data with a MultiIndex

In the video, Dhavide explained the concept of a hierarchical index, or a MultiIndex. You will now practice working with these types of indexes.

The sales DataFrame you have been working with has been extended to now include State information as well. In the IPython Shell, print the new sales DataFrame to inspect the data. Take note of the MultiIndex!

Extracting elements from the outermost level of a MultiIndex is just like in the case of a single-level Index. You can use the .loc[] accessor as Dhavide demonstrated in the video.

Instructions

  • Print sales.loc[[‘CA’, ‘TX’]]. Note how New York is excluded.
  • Print sales[‘CA’:’TX’]. Note how New York is included.
1
2
3
4
5
# Print sales.loc[['CA', 'TX']]
print(sales.loc[['CA', 'TX']])

# Print sales['CA':'TX']
print('\n', sales['CA':'TX'])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
             eggs  salt  spam
state month                  
CA    1        47  12.0    17
      2       110  50.0    31
TX    1       132   NaN    52
      2       205  60.0    55

              eggs  salt  spam
state month                  
CA    1        47  12.0    17
      2       110  50.0    31
NY    1       221  89.0    72
      2        77  87.0    20
TX    1       132   NaN    52
      2       205  60.0    55

Notice how New York is excluded by the first operation, and included in the second one.

Setting & sorting a MultiIndex

In the previous exercise, the MultiIndex was created and sorted for you. Now, you’re going to do this yourself! With a MultiIndex, you should always ensure the index is sorted. You can skip this only if you know the data is already sorted on the index fields.

To get started, print the pre-loaded sales DataFrame in the IPython Shell to verify that there is no MultiIndex.

Instructions

  • Create a MultiIndex by setting the index to be the columns [‘state’, ‘month’].
  • Sort the MultiIndex using the .sort_index() method.
  • Print the sales DataFrame. This has been done for you, so hit ‘Submit Answer’ to verify that indeed you have an index with the fields state and month!
1
2
3
4
5
6
7
8
9
10
sales = pd.read_csv(sales2_data)

# Set the index to be the columns ['state', 'month']: sales
sales = sales.set_index(['state', 'month'])

# Sort the MultiIndex: sales
sales = sales.sort_index()

# Print the sales DataFrame
sales
eggssaltspam
statemonth
CA14712.017
211050.031
NY122189.072
27787.020
TX1132NaN52
220560.055

Using .loc[] with nonunique indexes

As Dhavide mentioned in the video, it is always preferable to have a meaningful index that uniquely identifies each row. Even though pandas does not require unique index values in DataFrames, it works better if the index values are indeed unique. To see an example of this, you will index your sales data by ‘state’ in this exercise.

As always, begin by printing the sales DataFrame in the IPython Shell and inspecting it.

Instructions

  • Set the index of sales to be the column ‘state’.
  • Print the sales DataFrame to verify that indeed you have an index with state values.
  • Access the data from ‘NY’ and print it to verify that you obtain two rows.
1
2
3
4
5
6
7
8
9
10
sales = pd.read_csv(sales2_data)

# Set the index to the column 'state': sales
sales = sales.set_index('state')

# Print the sales DataFrame
print(sales)

# Access the data from 'NY'
print('\n', sales.loc['NY'])
1
2
3
4
5
6
7
8
9
10
11
12
13
       month  eggs  salt  spam
state                         
CA         1    47  12.0    17
CA         2   110  50.0    31
NY         1   221  89.0    72
NY         2    77  87.0    20
TX         1   132   NaN    52
TX         2   205  60.0    55

        month  eggs  salt  spam
state                         
NY         1   221  89.0    72
NY         2    77  87.0    20

Indexing multiple levels of a MultiIndex

Looking up indexed data is fast and efficient. And you have already seen that lookups based on the outermost level of a MultiIndex work just like lookups on DataFrames that have a single-level Index.

Looking up data based on inner levels of a MultiIndex can be a bit trickier. In this exercise, you will use your sales DataFrame to do some increasingly complex lookups.

The trickiest of all these lookups are when you want to access some inner levels of the index. In this case, you need to use slice(None) in the slicing parameter for the outermost dimension(s) instead of the usual :, or use pd.IndexSlice. You can refer to the pandas documentation for more details. For example, in the video, Dhavide used the following code to extract rows from all Symbols for the dates Oct. 3rd through 4th inclusive:

1
stocks.loc[(slice(None), slice('2016-10-03', '2016-10-04')), :]

Pay particular attention to the tuple:

1
(slice(None), slice('2016-10-03', '2016-10-04')).

Instructions

  • Look up data for the New York column (‘NY’) in month 1.
  • Look up data for the California and Texas columns (‘CA’, ‘TX’) in month 2.
  • Look up data for all states in month 2. Use (slice(None), 2) to extract all rows in month 2
1
2
sales = pd.read_csv(sales2_data, index_col=['state', 'month'])
sales
eggssaltspam
statemonth
CA14712.017
211050.031
NY122189.072
27787.020
TX1132NaN52
220560.055
1
2
3
# Look up data for NY in month 1: NY_month1
NY_month1 = sales.loc[('NY', 1), :]
NY_month1
1
2
3
4
eggs    221.0
salt     89.0
spam     72.0
Name: (NY, 1), dtype: float64
1
2
3
# Look up data for CA and TX in month 2: CA_TX_month2
CA_TX_month2 = sales.loc[(('CA', 'TX'), 2), :]
CA_TX_month2
eggssaltspam
statemonth
CA211050.031
TX220560.055
1
2
3
# Look up data for all states in month 2: all_month2
all_month2 = sales.loc[(slice(None), 2), :]
all_month2
eggssaltspam
statemonth
CA211050.031
NY27787.020
TX220560.055

Rearranging and reshaping data

Here, you will learn how to reshape your DataFrames using techniques such as pivoting, melting, stacking, and unstacking. These are powerful techniques that allow you to tidy and rearrange your data into the format that allows you to most easily analyze it for insights.

Pivoting DataFrames

Clinical Trials Data

1
2
3
4
5
6
trials = pd.DataFrame([[1, 'A', 'F', 5],
                       [2, 'A', 'M', 3],
                       [3, 'B', 'F', 8],
                       [4, 'B', 'M', 9]],
                      columns=['id', 'treatment', 'gender', 'response'])
trials
idtreatmentgenderresponse
01AF5
12AM3
23BF8
34BM9

Reshaping by pivoting

1
2
3
trials.pivot(index='treatment',
             columns='gender',
             values='response')
genderFM
treatment
A53
B89

Pivoting multiple columns

1
trials.pivot(index='treatment', columns='gender')
idresponse
genderFMFM
treatment
A1253
B3489

Exercises

Pivoting and the index

Prior to using .pivot(), you need to set the index of the DataFrame somehow. Is this statement True or False?

Answer the question

False

Pivoting a single variable

Suppose you started a blog for a band, and you would like to log how many visitors you have had, and how many signed-up for your newsletter. To help design the tours later, you track where the visitors are. A DataFrame called users consisting of this information has been pre-loaded for you.

Inspect users in the IPython Shell and make a note of which variable you want to use to index the rows (‘weekday’), which variable you want to use to index the columns (‘city’), and which variable will populate the values in the cells (‘visitors’). Try to visualize what the result should be.

For example, in the video, Dhavide used ‘treatment’ to index the rows, ‘gender’ to index the columns, and ‘response’ to populate the cells. Prior to pivoting, the DataFrame looked like this:

1
2
3
4
5
   id treatment gender  response
0   1         A      F         5
1   2         A      M         3
2   3         B      F         8
3   4         B      M         9

After pivoting:

1
2
3
4
gender     F  M
treatment      
A          5  3
B          8  9

In this exercise, your job is to pivot users so that the focus is on ‘visitors’, with the columns indexed by ‘city’ and the rows indexed by ‘weekday’.

Instructions

  • Pivot the users DataFrame with the rows indexed by ‘weekday’, the columns indexed by ‘city’, and the values populated with ‘visitors’.
  • Print the pivoted DataFrame. This has been done for you, so hit ‘Submit Answer’ to view the result.
1
2
users = pd.read_csv(users_data)
users
weekdaycityvisitorssignups
0SunAustin1397
1SunDallas23712
2MonAustin3263
3MonDallas4565
1
2
3
4
5
# Pivot the users DataFrame: visitors_pivot
visitors_pivot = users.pivot(index='weekday',
                             columns='city',
                             values='visitors')
visitors_pivot
cityAustinDallas
weekday
Mon326456
Sun139237

Notice how in the pivoted DataFrame, the index is labeled ‘weekday’, the columns are labeled ‘city’, and the values are populated by the number of visitors.

Pivoting all variables

If you do not select any particular variables, all of them will be pivoted. In this case - with the users DataFrame - both ‘visitors’ and ‘signups’ will be pivoted, creating hierarchical column labels.

You will explore this for yourself now in this exercise.

Instructions

  • Pivot the users DataFrame with the ‘signups’ indexed by ‘weekday’ in the rows and ‘city’ in the columns.
  • Print the new DataFrame. This has been done for you.
  • Pivot the users DataFrame with both ‘signups’ and ‘visitors’ pivoted - that is, all the variables. This will happen automatically if you do not specify an argument for the values parameter of .pivot().
  • Print the pivoted DataFrame. This has been done for you, so hit ‘Submit Answer’ to see the result.
1
2
3
# Pivot users with signups indexed by weekday and city: signups_pivot
signups_pivot = users.pivot(index='weekday', columns='city', values='signups')
signups_pivot
cityAustinDallas
weekday
Mon35
Sun712
1
2
3
# Pivot users pivoted by both signups and visitors: pivot
pivot = users.pivot(index='weekday', columns='city')
pivot
visitorssignups
cityAustinDallasAustinDallas
weekday
Mon32645635
Sun139237712

Notice how in the second DataFrame, both ‘signups’ and ‘visitors’ were pivoted by default since you didn’t provide an argument for the values parameter.

Stacking & unstacking DataFrames

Creating a multi-level index

1
trials
idtreatmentgenderresponse
01AF5
12AM3
23BF8
34BM9
1
2
trials = trials.set_index(['treatment', 'gender'])
trials
idresponse
treatmentgender
AF15
M23
BF38
M49

Unstacking a multi-index (1)

1
trials.unstack(level='gender')
idresponse
genderFMFM
treatment
A1253
B3489

Unstacking a multi-index (2)

1
trials
idresponse
treatmentgender
AF15
M23
BF38
M49
1
trials.unstack(level=1)
idresponse
genderFMFM
treatment
A1253
B3489

Stacking DataFrames

1
2
trials_by_gender = trials.unstack(level='gender')
trials_by_gender
idresponse
genderFMFM
treatment
A1253
B3489
1
trials_by_gender.stack(level='gender')
idresponse
treatmentgender
AF15
M23
BF38
M49
1
2
stacked = trials_by_gender.stack(level='gender')
stacked
idresponse
treatmentgender
AF15
M23
BF38
M49

Swapping levels

1
2
swapped = stacked.swaplevel(0, 1)
swapped
idresponse
gendertreatment
FA15
MA23
FB38
MB49

Sorting rows

1
2
sorted_trials = swapped.sort_index()
sorted_trials
idresponse
gendertreatment
FA15
B38
MA23
B49

Exercises

Stacking & unstacking I

You are now going to practice stacking and unstacking DataFrames. The users DataFrame you have been working with in this chapter has been pre-loaded for you, this time with a MultiIndex. Explore it in the IPython Shell to see the data layout. Pay attention to the index, and notice that the index levels are [‘city’, ‘weekday’]. So ‘weekday’ - the second entry - has position 1. This position is what corresponds to the level parameter in .stack() and .unstack() calls. Alternatively, you can specify ‘weekday’ as the level instead of its position.

Your job in this exercise is to unstack users by ‘weekday’. You will then use .stack() on the unstacked DataFrame to see if you get back the original layout of users.

Instructions

  • Define a DataFrame byweekday with the ‘weekday’ level of users unstacked.
  • Print the byweekday DataFrame to see the new data layout. This has been done for you.
  • Stack byweekday by ‘weekday’ and print it to check if you get the same layout as the original users DataFrame.
1
2
3
4
users = pd.read_csv(users_data)
users.set_index(['city', 'weekday'], inplace=True)
users = users.sort_index()
users
visitorssignups
cityweekday
AustinMon3263
Sun1397
DallasMon4565
Sun23712
1
2
3
# Unstack users by 'weekday': byweekday
byweekday = users.unstack(level='weekday')
byweekday
visitorssignups
weekdayMonSunMonSun
city
Austin32613937
Dallas456237512
1
2
# Stack byweekday by 'weekday' and print it
byweekday.stack(level='weekday')
visitorssignups
cityweekday
AustinMon3263
Sun1397
DallasMon4565
Sun23712

Stacking & unstacking II

You are now going to continue working with the users DataFrame. As always, first explore it in the IPython Shell to see the layout and note the index.

Your job in this exercise is to unstack and then stack the ‘city’ level, as you did previously for ‘weekday’. Note that you won’t get the same DataFrame.

Instructions

  • Define a DataFrame bycity with the ‘city’ level of users unstacked.
  • Print the bycity DataFrame to see the new data layout. This has been done for you.
  • Stack bycity by ‘city’ and print it to check if you get the same layout as the original users DataFrame.
1
users
visitorssignups
cityweekday
AustinMon3263
Sun1397
DallasMon4565
Sun23712
1
2
3
# Unstack users by 'city': bycity
bycity = users.unstack(level='city')
bycity
visitorssignups
cityAustinDallasAustinDallas
weekday
Mon32645635
Sun139237712
1
2
# Stack bycity by 'city' and print it
bycity.stack(level='city')
visitorssignups
weekdaycity
MonAustin3263
Dallas4565
SunAustin1397
Dallas23712

Restoring the index order

Continuing from the previous exercise, you will now use .swaplevel(0, 1) to flip the index levels. Note they won’t be sorted. To sort them, you will have to follow up with a .sort_index(). You will then obtain the original DataFrame. Note that an unsorted index leads to slicing failures.

To begin, print both users and bycity in the IPython Shell. The goal here is to convert bycity back to something that looks like users.

Instructions

  • Define a DataFrame newusers with the ‘city’ level stacked back into the index of bycity.
  • Swap the levels of the index of newusers.
  • Print newusers and verify that the index is not sorted. This has been done for you.
  • Sort the index of newusers.
  • Print newusers and verify that the index is now sorted. This has been done for you.
  • Assert that newusers equals users. This has been done for you, so hit ‘Submit Answer’ to see the result.
1
bycity
visitorssignups
cityAustinDallasAustinDallas
weekday
Mon32645635
Sun139237712
1
2
3
# Stack 'city' back into the index of bycity: newusers
newusers = bycity.stack(level='city')
newusers
visitorssignups
weekdaycity
MonAustin3263
Dallas4565
SunAustin1397
Dallas23712
1
2
3
# Swap the levels of the index of newusers: newusers
newusers = newusers.swaplevel(0, 1)
newusers
visitorssignups
cityweekday
AustinMon3263
DallasMon4565
AustinSun1397
DallasSun23712
1
2
3
# Sort the index of newusers: newusers
newusers = newusers.sort_index()
newusers
visitorssignups
cityweekday
AustinMon3263
Sun1397
DallasMon4565
Sun23712
1
2
# Verify that the new DataFrame is equal to the original
newusers.equals(users)
1
True

Melting DataFrames

Clinical Trials Data

1
2
3
4
5
6
trials = pd.DataFrame([[1, 'A', 'F', 5],
                       [2, 'A', 'M', 3],
                       [3, 'B', 'F', 8],
                       [4, 'B', 'M', 9]],
                      columns=['id', 'treatment', 'gender', 'response'])
trials
idtreatmentgenderresponse
01AF5
12AM3
23BF8
34BM9

Clinical trials after pivoting

1
2
3
4
trials.pivot(index='treatment',
             columns='gender',
             values='response')

genderFM
treatment
A53
B89

Clinical trials data - new

1
2
3
4
new_trials = pd.DataFrame([['A', 5, 3],
                           ['B', 8, 9],],
                          columns=['treatment', 'F', 'M'])
new_trials
treatmentFM
0A53
1B89

Melting DataFrame

1
pd.melt(new_trials)
variablevalue
0treatmentA
1treatmentB
2F5
3F8
4M3
5M9

Specifying id_vars

1
pd.melt(new_trials, id_vars=['treatment'])
treatmentvariablevalue
0AF5
1BF8
2AM3
3BM9

Specifying value_vars

1
2
pd.melt(new_trials, id_vars=['treatment'],
        value_vars=['F', 'M'])
treatmentvariablevalue
0AF5
1BF8
2AM3
3BM9

Specifying value_name

1
2
pd.melt(new_trials, id_vars=['treatment'],
        var_name='gender', value_name='response')
treatmentgenderresponse
0AF5
1BF8
2AM3
3BM9

Exercises

Adding names for readability

You are now going to practice melting DataFrames. A DataFrame called visitors_by_city_weekday has been pre-loaded for you. Explore it in the IPython Shell and see that it is the users DataFrame from previous exercises with the rows indexed by ‘weekday’, columns indexed by ‘city’, and values populated with ‘visitors’.

Recall from the video that the goal of melting is to restore a pivoted DataFrame to its original form, or to change it from a wide shape to a long shape. You can explicitly specify the columns that should remain in the reshaped DataFrame with id_vars, and list which columns to convert into values with value_vars. As Dhavide demonstrated, if you don’t pass a name to the values in d.melt(), you will lose the name of your variable. You can fix this by using the value_name keyword argument.

Your job in this exercise is to melt visitors_by_city_weekday to move the city names from the column labels to values in a single column called ‘city’. If you were to use just pd.melt(visitors_by_city_weekday), you would obtain the following result:

1
2
3
4
5
6
7
      city value
0  weekday   Mon
1  weekday   Sun
2   Austin   326
3   Austin   139
4   Dallas   456
5   Dallas   237

Therefore, you have to specify the id_vars keyword argument to ensure that ‘weekday’ is retained in the reshaped DataFrame, and the value_name keyword argument to change the name of value to visitors.

Instructions

  • Reset the index of visitors_by_city_weekday with .reset_index().
  • Print visitors_by_city_weekday and verify that you have just a range index, 0, 1, 2, 3. This has been done for you.
  • Melt visitors_by_city_weekday to move the city names from the column labels to values in a single column called visitors.
  • Print visitors to check that the city values are in a single column now and that the dataframe is longer and skinnier.
1
2
3
4
5
# Create the DataFrame for the exercise
visitors = pd.DataFrame({'weekday': ['Mon', 'Sun', 'Mon', 'Sun'],
                         'city': ['Austin', 'Austin', 'Dallas', 'Dallas'],
                         'visitors': [326, 139, 456, 237]})
visitors
weekdaycityvisitors
0MonAustin326
1SunAustin139
2MonDallas456
3SunDallas237
1
2
3
4
5
# Reshape the DataFrame for the exercise
visitors_by_city_weekday = visitors.pivot(index='weekday',
                                          columns='city',
                                          values='visitors')
visitors_by_city_weekday
cityAustinDallas
weekday
Mon326456
Sun139237
1
2
3
# Reset the index: visitors_by_city_weekday
visitors_by_city_weekday = visitors_by_city_weekday.reset_index()
visitors_by_city_weekday
cityweekdayAustinDallas
0Mon326456
1Sun139237
1
2
3
# Melt visitors_by_city_weekday: visitors
visitors = pd.melt(visitors_by_city_weekday, id_vars=['weekday'], value_name='visitors')
visitors
weekdaycityvisitors
0MonAustin326
1SunAustin139
2MonDallas456
3SunDallas237

Notice how the melted DataFrame now has a ‘city’ column with Austin and Dallas as its values. In the original DataFrame, they were columns themselves. Also note how specifying the value_name parameter has renamed the ‘value’ column to ‘visitors’.

Going from wide to long

You can move multiple columns into a single column (making the data long and skinny) by “melting” multiple columns. In this exercise, you will practice doing this.

The users DataFrame has been pre-loaded for you. As always, explore it in the IPython Shell and note the index.

Instructions

  • Define a DataFrame skinny where you melt the ‘visitors’ and ‘signups’ columns of users into a single column.
  • Print skinny to verify the results. Note the value column that had the cell values in users.
1
2
users = pd.read_csv(users_data)
users
weekdaycityvisitorssignups
0SunAustin1397
1SunDallas23712
2MonAustin3263
3MonDallas4565
1
2
3
# Melt users: skinny
skinny = users.melt(id_vars=['weekday', 'city'])
skinny
weekdaycityvariablevalue
0SunAustinvisitors139
1SunDallasvisitors237
2MonAustinvisitors326
3MonDallasvisitors456
4SunAustinsignups7
5SunDallassignups12
6MonAustinsignups3
7MonDallassignups5

Because var_name or value_name parameters weren’t specified, the melted DataFrame has the default variable and value column names.

Obtaining key-value pairs with melt()

Sometimes, all you need is some key-value pairs, and the context does not matter. If said context is in the index, you can easily obtain what you want. For example, in the users DataFrame, the visitors and signups columns lend themselves well to being represented as key-value pairs. So if you created a hierarchical index with ‘city’ and ‘weekday’ columns as the index, you can easily extract key-value pairs for the ‘visitors’ and ‘signups’ columns by melting users and specifying col_level=0.

Instructions

  • Set the index of users to [‘city’, ‘weekday’].
  • Print the DataFrame users_idx to see the new index.
  • Obtain the key-value pairs corresponding to visitors and signups by melting users_idx with the keyword argument col_level=0.
1
users
weekdaycityvisitorssignups
0SunAustin1397
1SunDallas23712
2MonAustin3263
3MonDallas4565
1
2
3
# Set the new index: users_idx
users_idx = users.set_index(['city', 'weekday'])
users_idx
visitorssignups
cityweekday
AustinSun1397
DallasSun23712
AustinMon3263
DallasMon4565
1
2
3
# Obtain the key-value pairs: kv_pairs
kv_pairs = users_idx.melt(col_level=0)
kv_pairs
variablevalue
0visitors139
1visitors237
2visitors326
3visitors456
4signups7
5signups12
6signups3
7signups5

Pivot tables

More clinical trials data

1
2
3
4
5
6
7
8
9
10
more_trials = pd.DataFrame([[1, 'A', 'F', 5],
                            [2, 'A', 'M', 3],
                            [3, 'A', 'M', 8],
                            [4, 'A', 'F', 9],
                            [5, 'B', 'F', 1],
                            [6, 'B', 'M', 8],
                            [7, 'B', 'F', 4],
                            [8, 'B', 'F', 6]],
                           columns=['id', 'treatment', 'gender', 'response'])
more_trials
idtreatmentgenderresponse
01AF5
12AM3
23AM8
34AF9
45BF1
56BM8
67BF4
78BF6

Rearranging by pivoting

  • .pivot requires unique index/column pairs to identify values in the new table.
1
2
3
4
5
6
try:
    more_trials.pivot(index='treatment',
                      columns='gender',
                      values='response')
except ValueError:
    print('ValueError: Index contains duplicate entries, cannot reshape')
1
ValueError: Index contains duplicate entries, cannot reshape

Pivot table

  • The pivot_talbe method reshapes the DataFrame by summarizing it with a pair of summarizing variables and their values.
  • Pivot tables deal with multiple values for the same index/column pair using a reduction
  • By default, the reduction is an average
1
2
3
more_trials.pivot_table(index='treatment',
                        columns='gender',
                        values='response')
genderFM
treatment
A7.0000005.5
B3.6666678.0

Other aggregations

1
2
3
4
more_trials.pivot_table(index='treatment',
                        columns='gender',
                        values='response',
                        aggfunc='count')
genderFM
treatment
A22
B31

Exercises

Setting up a pivot table

Recall from the video that a pivot table allows you to see all of your variables as a function of two other variables. In this exercise, you will use the .pivot_table() method to see how the users DataFrame entries appear when presented as functions of the ‘weekday’ and ‘city’ columns. That is, with the rows indexed by ‘weekday’ and the columns indexed by ‘city’.

Before using the pivot table, print the users DataFrame in the IPython Shell and observe the layout.

Instructions

  • Use a pivot table to index the rows of users by ‘weekday’ and the columns of users by ‘city’. These correspond to the index and columns parameters of .pivot_table().
  • Print by_city_day. This has been done for you, so hit ‘Submit Answer’ to see the result.
1
users = pd.read_csv(users_data)
1
2
3
4
# Create the DataFrame with the appropriate pivot table: by_city_day
by_city_day = users.pivot_table(index='weekday',
                                columns='city')
by_city_day
signupsvisitors
cityAustinDallasAustinDallas
weekday
Mon3.05.0326.0456.0
Sun7.012.0139.0237.0

Notice the labels of the index and the columns are ‘weekday’ and ‘city’, respectively - exactly as you specified.

Using other aggregations in pivot tables

You can also use aggregation functions within a pivot table by specifying the aggfunc parameter. In this exercise, you will practice using the ‘count’ and len aggregation functions - which produce the same result - on the users DataFrame.

Instructions

  • Define a DataFrame count_by_weekday1 that shows the count of each column with the parameter aggfunc=’count’. The index here is ‘weekday’.
  • Print count_by_weekday1. This has been done for you.
  • Replace aggfunc=’count’ with aggfunc=len and verify you obtain the same result.
1
2
3
4
# Use a pivot table to display the count of each column: count_by_weekday1
count_by_weekday1 = users.pivot_table(index='weekday',
                                      aggfunc='count')
count_by_weekday1
citysignupsvisitors
weekday
Mon222
Sun222
1
2
3
4
# Replace 'aggfunc='count'' with 'aggfunc=len': count_by_weekday2
count_by_weekday2 = users.pivot_table(index='weekday',
                                      aggfunc=len)
count_by_weekday2
citysignupsvisitors
weekday
Mon222
Sun222
1
2
# Verify that the same result is obtained
count_by_weekday1.equals(count_by_weekday2)
1
True

Using margins in pivot tables

Sometimes it’s useful to add totals in the margins of a pivot table. You can do this with the argument margins=True. In this exercise, you will practice using margins in a pivot table along with a new aggregation function: sum.

The users DataFrame, which you are now probably very familiar with, has been pre-loaded for you.

Instructions

  • Define a DataFrame signups_and_visitors that shows the breakdown of signups and visitors by day.
    • You will need to use aggfunc=sum to do this.
  • Print signups_and_visitors. This has been done for you.
  • Now pass the additional argument margins=True to the .pivot_table() method to obtain the totals.
  • Print signups_and_visitors_total. This has been done for you, so hit ‘Submit Answer’ to see the result.
1
2
3
4
# Create the DataFrame with the appropriate pivot table: signups_and_visitors
signups_and_visitors = users.pivot_table(index='weekday',
                                         aggfunc='sum')
signups_and_visitors
citysignupsvisitors
weekday
MonAustinDallas8782
SunAustinDallas19376
1
2
3
4
5
6
# Add in the margins: signups_and_visitors_total 
signups_and_visitors_total = users.pivot_table(index='weekday',
                                               values=['signups', 'visitors'],
                                               aggfunc='sum',
                                               margins=True)
signups_and_visitors_total
signupsvisitors
weekday
Mon8782
Sun19376
All271158

Specifying margins=True resulted in the totals in each column being computed.

Grouping data

In this chapter, you’ll learn how to identify and split DataFrames by groups or categories for further aggregation or analysis. You’ll also learn how to transform and filter your data, including how to detect outliers and impute missing values. Knowing how to effectively group data in pandas can be a seriously powerful addition to your data science toolbox.

Categorical and groupby

Sales Data

1
2
3
4
5
6
7
8
9
sales = pd.DataFrame(
    {
        'weekday': ['Sun', 'Sun', 'Mon', 'Mon'],
        'city': ['Austin', 'Dallas', 'Austin', 'Dallas'],
        'bread': [139, 237, 326, 456],
        'butter': [20, 45, 70, 98]
    }
)
sales
weekdaycitybreadbutter
0SunAustin13920
1SunDallas23745
2MonAustin32670
3MonDallas45698

Boolean filter and count

1
sales.loc[sales['weekday'] == 'Sun'].count()
1
2
3
4
5
weekday    2
city       2
bread      2
butter     2
dtype: int64

Groupby and count

1
sales.groupby('weekday').count()
citybreadbutter
weekday
Mon222
Sun222

Split-apply-combine

  • sales.groupby(‘weekday’).count()
    • split by ‘weekday’
    • apply count() function on each group
    • combine counts per group

Aggregation/Reduction

  • Some reducing functions
    • mean()
    • std()
    • sum()
    • first(), last()
    • min(), max()

Groupby and sum

1
sales.groupby('weekday')['bread'].sum()
1
2
3
4
weekday
Mon    782
Sun    376
Name: bread, dtype: int64

Groupby and sum: multiple columns

1
sales
weekdaycitybreadbutter
0SunAustin13920
1SunDallas23745
2MonAustin32670
3MonDallas45698
1
sales.groupby('weekday')[['bread','butter']].sum()
breadbutter
weekday
Mon782168
Sun37665

Groupby and mean: multi-level index

1
sales.groupby(['city','weekday']).mean()
breadbutter
cityweekday
AustinMon326.070.0
Sun139.020.0
DallasMon456.098.0
Sun237.045.0

Customers

1
2
customers = pd.Series(['Dave','Alice','Bob','Alice'])
customers
1
2
3
4
5
0     Dave
1    Alice
2      Bob
3    Alice
dtype: object

Groupby and sum: by series

1
sales.groupby(customers)['bread'].sum()
1
2
3
4
Alice    693
Bob      326
Dave     139
Name: bread, dtype: int64

Categorical data

1
sales['weekday'].unique()
1
array(['Sun', 'Mon'], dtype=object)
1
2
sales['weekday'] = sales['weekday'].astype('category')
sales['weekday']
1
2
3
4
5
6
0    Sun
1    Sun
2    Mon
3    Mon
Name: weekday, dtype: category
Categories (2, object): ['Mon', 'Sun']

Categorical data

  • Advantages
    • Use less memory
    • Speeds up operations like groupby()

Exercises

Advantages of categorical data types

What are the main advantages of storing data explicitly as categorical types instead of object types?

Answer the question

  • Computations are faster.
  • Categorical data require less space in memory.
  • All of the above.

Grouping by multiple columns

In this exercise, you will return to working with the Titanic dataset from Chapter 1 and use .groupby() to analyze the distribution of passengers who boarded the Titanic.

The ‘pclass’ column identifies which class of ticket was purchased by the passenger and the ‘embarked’ column indicates at which of the three ports the passenger boarded the Titanic. ‘S’ stands for Southampton, England, ‘C’ for Cherbourg, France and ‘Q’ for Queenstown, Ireland.

Your job is to first group by the ‘pclass’ column and count the number of rows in each class using the ‘survived’ column. You will then group by the ‘embarked’ and ‘pclass’ columns and count the number of passengers.

The DataFrame has been pre-loaded as titanic.

Instructions

  • Group by the ‘pclass’ column and save the result as by_class.
  • Aggregate the ‘survived’ column of by_class using .count(). Save the result as count_by_class.
  • Print count_by_class. This has been done for you.
  • Group titanic by the ‘embarked’ and ‘pclass’ columns. Save the result as by_mult.
  • Aggregate the ‘survived’ column of by_mult using .count(). Save the result as count_mult.
  • Print count_mult. This has been done for you, so hit ‘Submit Answer’ to view the result.

Titanic Data

1
2
titanic = pd.read_csv(titanic_data)
titanic.head(3)
pclasssurvivednamesexagesibspparchticketfarecabinembarkedboatbodyhome.dest
011Allen, Miss. Elisabeth Waltonfemale29.000024160211.3375B5S2NaNSt Louis, MO
111Allison, Master. Hudson Trevormale0.9212113781151.5500C22 C26S11NaNMontreal, PQ / Chesterville, ON
210Allison, Miss. Helen Lorainefemale2.0012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ON
1
2
3
4
5
6
# Group titanic by 'pclass'
by_class = titanic.groupby('pclass')

# Aggregate 'survived' column of by_class by count
count_by_class = by_class['survived'].count()
count_by_class
1
2
3
4
5
pclass
1    323
2    277
3    709
Name: survived, dtype: int64
1
2
3
4
5
6
# Group titanic by 'embarked' and 'pclass'
by_mult = titanic.groupby(['embarked', 'pclass'])

# Aggregate 'survived' column of by_mult by count
count_mult = by_mult['survived'].count()
count_mult
1
2
3
4
5
6
7
8
9
10
11
embarked  pclass
C         1         141
          2          28
          3         101
Q         1           3
          2           7
          3         113
S         1         177
          2         242
          3         495
Name: survived, dtype: int64

Grouping data by certain columns and aggregating them by another column, in this case, ‘survived’, allows for carefull examination of the data for interesting insights.

Grouping by another series

In this exercise, you’ll use two data sets from Gapminder.org to investigate the average life expectancy (in years) at birth in 2010 for the 6 continental regions. To do this you’ll read the life expectancy data per country into one pandas DataFrame and the association between country and region into another.

By setting the index of both DataFrames to the country name, you’ll then use the region information to group the countries in the life expectancy DataFrame and compute the mean value for 2010.

The life expectancy CSV file is available to you in the variable life_fname and the regions filename is available in the variable regions_fname.

Instructions

  • Read life_fname into a DataFrame called life and set the index to ‘Country’.
  • Read regions_fname into a DataFrame called regions and set the index to ‘Country’.
  • Group life by the region column of regions and store the result in life_by_region.
  • Print the mean over the 2010 column of life_by_region.
1
2
3
4
5
6
# Read life_fname into a DataFrame: life
life = pd.read_csv(gapminder_data, usecols=['Year', 'life', 'Country'])
life = life.pivot(index='Country',
                  columns='Year',
                  values='life')
life.head()
Year19641965196619671968196919701971197219731974197519761977197819791980198119821983198419851986198719881989199019911992199319941995199619971998199920002001200220032004200520062007200820092010201120122013
Country
Afghanistan33.63934.15234.66235.17035.67436.17236.66337.14337.61438.07538.52938.97739.41739.85540.29840.75641.24241.77042.34742.97743.66144.40045.19246.02446.88047.74448.60149.43950.24751.01751.73852.40052.99553.52754.00954.44954.86355.27155.68756.12256.58357.07157.58258.10258.61859.12459.61260.07960.52460.947
Albania65.47565.86366.12266.31666.50066.70266.94867.25167.59567.96668.35668.74869.12169.45969.75370.00170.21870.42670.64670.88671.14471.39871.61571.77071.85371.87071.84271.79971.77971.81371.92072.11772.41572.79673.23573.71374.20074.66475.08175.43775.72575.94976.12476.27876.43376.59876.78076.97977.18577.392
Algeria47.95348.38948.80649.20549.59249.97650.36650.76751.19551.67052.21352.86153.65654.60555.69756.90758.19859.52460.82662.05163.16064.12064.91165.55466.07266.47966.79667.04967.26567.46867.67467.89368.12368.35068.56568.76968.96369.14969.33069.50869.68269.85470.02070.18070.33270.47770.61570.74770.87471.000
Angola34.60435.00735.41035.81636.22236.62737.03237.43937.84638.24738.63538.99839.32439.60539.84040.02940.18240.31140.42940.54740.67140.79440.90240.98841.05041.10041.15141.22141.32941.49541.73642.07342.52643.08843.74244.46845.23446.00446.74347.42548.03648.57249.04149.47149.88250.28650.68951.09451.49851.899
Antigua and Barbuda63.77564.14964.51164.86565.21365.55865.89866.23266.55866.87567.18167.47967.76868.05168.32868.60268.87369.14169.40869.67169.93170.18670.43570.67570.90771.13271.35171.56871.78372.00072.21972.44172.66472.88873.11073.32973.54473.75573.96074.16074.35574.54474.72974.91075.08775.26375.43775.61075.78375.954
1
2
3
4
5
regions = pd.read_csv(gapminder_data, usecols=['region', 'Country'])
regions.drop_duplicates(subset='Country', inplace=True)
regions.reset_index(inplace=True, drop=True)
regions.set_index('Country', inplace=True)
regions.head()
region
Country
AfghanistanSouth Asia
AlbaniaEurope & Central Asia
AlgeriaMiddle East & North Africa
AngolaSub-Saharan Africa
Antigua and BarbudaAmerica
1
2
3
4
5
# Group life by regions['region']: life_by_region
life_by_region = life.groupby(regions['region'])

# Print the mean over the '2010' column of life_by_region
life_by_region[2010].mean()
1
2
3
4
5
6
7
8
region
America                       74.037350
East Asia & Pacific           73.405750
Europe & Central Asia         75.656387
Middle East & North Africa    72.805333
South Asia                    68.189750
Sub-Saharan Africa            57.575080
Name: 2010, dtype: float64
1
life_by_region.mean()
Year19641965196619671968196919701971197219731974197519761977197819791980198119821983198419851986198719881989199019911992199319941995199619971998199920002001200220032004200520062007200820092010201120122013
region
America60.46277560.89052561.30512561.72022562.11062562.51055062.90750063.30472563.68002564.06007564.43912564.82055065.19437565.56372565.92670066.29397566.64222567.00542567.35667567.70115068.03847568.35617568.66995068.97720069.26312569.54305069.81670070.06955070.32845070.53990070.77570070.98995071.22110071.45900071.68152571.90292572.14737572.36985072.60725072.84045073.08225073.31527573.56487573.78675074.00562574.24102574.03735074.61527574.88300075.087350
East Asia & Pacific56.79842957.17168157.64069458.31356658.73256959.21516259.58686959.97616860.25824660.48980660.71817560.66465360.90288661.26266761.70332262.17226162.74559963.45861864.05706364.58765665.09912565.46846965.85190666.19178166.49375066.83028167.17078167.51903167.80265668.10893768.39896968.65368868.95087569.28087569.62793869.94596970.34675070.71550071.07975071.43231271.76768772.07221972.28430372.65656372.90118873.16365673.40575073.62887573.85521974.077719
Europe & Central Asia67.84011067.99173868.22791868.45529968.59539968.60292968.88006469.07964369.37063169.53996669.73513269.80346769.97619070.21329070.30820970.47422070.55646470.75090171.00335371.07675571.27776371.41789371.79245071.92852872.01099972.04758972.00642971.99315372.05719271.97437672.10633972.23035772.62108173.00539873.28157073.41937573.73162173.98954474.18406974.36540374.69549974.85398975.12826075.17804575.33735875.49129475.65638775.81383675.96103976.108157
Middle East & North Africa52.11981052.80800053.48571454.15347654.81247655.46395256.11204856.76247657.41738158.07628658.73533359.38547660.01476260.61514361.18390561.72347662.24419062.76204863.29057163.84971464.41595264.99600065.55176266.11742966.65195267.16095267.62228668.01223868.36819068.73052469.05223869.34061969.64214369.89485770.14995270.40228670.64314370.88838171.10600071.33304871.55914371.76171471.97566772.17571472.38919072.59847672.80533372.99433373.10014373.259143
South Asia43.87712544.44250044.97600045.47062545.92700046.35462546.77537547.21825047.70762548.25612548.86562549.52575050.21675050.91012551.58237552.22325052.82625053.39237553.93387554.46237554.98375055.50550056.03462556.57412557.12425057.68387558.24712558.80362559.34612559.87075060.37975060.88150061.39000061.91537562.46300063.03037563.60925064.18775064.75025065.28487565.78562566.25000066.67950067.08262567.46650067.83450068.18975068.53325068.86587569.188000
Sub-Saharan Africa43.57936043.94552044.36138044.68718045.00512045.39642046.04118046.59042047.05608047.48492047.91254048.40068048.94472049.39180049.83166050.27722050.69302051.08908051.46244051.79860052.04978052.24922052.61252052.87112053.05214053.19736053.30486053.35942053.36928053.34532052.52654053.17468053.16415053.15346053.15599053.18774053.27103053.42742053.66024053.97736054.37674054.84994055.38172055.94510056.51378057.06982057.57508057.99406058.51464058.941440
1
2
3
life_by_region[[1964, 1970, 1980, 1990, 2000, 2010, 2013]].mean().plot(rot=45)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

png

Groupby and aggregation

Sales data

1
sales
weekdaycitybreadbutter
0SunAustin13920
1SunDallas23745
2MonAustin32670
3MonDallas45698

Review: groupby

1
sales.groupby('city')[['bread','butter']].max()
breadbutter
city
Austin32670
Dallas45698

Multiple aggregations

1
sales.groupby('city')[['bread','butter']].agg(['max','sum'])
breadbutter
maxsummaxsum
city
Austin3264657090
Dallas45669398143

Aggregation functions

  • string names
    • ‘sum’
    • ‘mean’
    • ‘count’

Custom aggregation

1
2
def data_range(series):
    return series.max() - series.min()
1
sales.groupby('weekday', observed=False)[['bread', 'butter']].agg(data_range)
breadbutter
weekday
Mon13028
Sun9825

Custom aggregation: dictionaries

1
sales.groupby(customers)[['bread', 'butter']].agg({'bread':'sum', 'butter':data_range})
breadbutter
Alice69353
Bob3260
Dave1390

Exercises

Computing multiple aggregates of multiple columns

The .agg() method can be used with a tuple or list of aggregations as input. When applying multiple aggregations on multiple columns, the aggregated DataFrame has a multi-level column index.

In this exercise, you’re going to group passengers on the Titanic by ‘pclass’ and aggregate the ‘age’ and ‘fare’ columns by the functions ‘max’ and ‘median’. You’ll then use multi-level selection to find the oldest passenger per class and the median fare price per class.

The DataFrame has been pre-loaded as titanic.

Instructions

  • Group titanic by ‘pclass’ and save the result as by_class.
  • Select the ‘age’ and ‘fare’ columns from by_class and save the result as by_class_sub.
  • Aggregate by_class_sub using ‘max’ and ‘median’. You’ll have to pass ‘max’ and ‘median’ in the form of a list to .agg().
  • Use .loc[] to print all of the rows and the column specification (‘age’,’max’). This has been done for you.
  • Use .loc[] to print all of the rows and the column specification (‘fare’,’median’).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Group titanic by 'pclass': by_class
by_class = titanic.groupby('pclass')

# Select 'age' and 'fare'
by_class_sub = by_class[['age','fare']]

# Aggregate by_class_sub by 'max' and 'median': aggregated
aggregated = by_class_sub.agg(['max', 'median'])
print(aggregated)

# Print the maximum age in each class
print('\nMaximum Age:\n', aggregated.loc[:, ('age','max')])

# Print the median fare in each class
print('\nMedian Fare:\n', aggregated.loc[:, ('fare','median')])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
         age             fare         
         max median       max   median
pclass                                
1       80.0   39.0  512.3292  60.0000
2       70.0   29.0   73.5000  15.0458
3       74.0   24.0   69.5500   8.0500

Maximum Age:
 pclass
1    80.0
2    70.0
3    74.0
Name: (age, max), dtype: float64

Median Fare:
 pclass
1    60.0000
2    15.0458
3     8.0500
Name: (fare, median), dtype: float64

Aggregating on index levels/fields

If you have a DataFrame with a multi-level row index, the individual levels can be used to perform the groupby. This allows advanced aggregation techniques to be applied along one or more levels in the index and across one or more columns.

In this exercise you’ll use the full Gapminder dataset which contains yearly values of life expectancy, population, child mortality (per 1,000) and per capita gross domestic product (GDP) for every country in the world from 1964 to 2013.

Your job is to create a multi-level DataFrame of the columns ‘Year’, ‘Region’ and ‘Country’. Next you’ll group the DataFrame by the ‘Year’ and ‘Region’ levels. Finally, you’ll apply a dictionary aggregation to compute the total population, spread of per capita GDP values and average child mortality rate.

The Gapminder CSV file is available as ‘gapminder.csv’.

Instructions

  • Read ‘gapminder.csv’ into a DataFrame with index_col=[‘Year’,’region’,’Country’]. Sort the index.
  • Group gapminder with a level of [‘Year’,’region’] using its level parameter. Save the result as by_year_region.
  • Define the function spread which returns the maximum and minimum of an input series. This has been done for you.
  • Create a dictionary with ‘population’:’sum’, ‘child_mortality’:’mean’ and ‘gdp’:spread as aggregator. This has been done for you.
  • Use the aggregator dictionary to aggregate by_year_region. Save the result as aggregated.
  • Print the last 6 entries of aggregated. This has been done for you, so hit ‘Submit Answer’ to view the result.
1
2
3
4
# Read the CSV file into a DataFrame and sort the index: gapminder
gapminder = pd.read_csv(gapminder_data, index_col=['Year', 'region', 'Country'])
gapminder.sort_index(inplace=True)
gapminder.head()
fertilitylifepopulationchild_mortalitygdp
YearregionCountry
1964AmericaAntigua and Barbuda4.25063.77558653.072.785008.0
Argentina3.06865.38821966478.057.438227.0
Aruba4.05967.11357031.0NaN5505.0
Bahamas4.22064.189133709.048.5618160.0
Barbados4.09462.819234455.064.705681.0
1
2
3
4
5
6
7
8
9
10
11
12
13
# Group gapminder by 'Year' and 'region': by_year_region
by_year_region = gapminder.groupby(level=['Year', 'region'])

# Define the function to compute spread: spread
def spread(series):
    return series.max() - series.min()

# Create the dictionary: aggregator
aggregator = {'population':'sum', 'child_mortality':'mean', 'gdp':spread}

# Aggregate by_year_region using the dictionary: aggregated
aggregated = by_year_region.agg(aggregator)
aggregated.tail(6)
populationchild_mortalitygdp
Yearregion
2013America9.629087e+0817.74583349634.0
East Asia & Pacific2.244209e+0922.285714134744.0
Europe & Central Asia8.968788e+089.83187586418.0
Middle East & North Africa4.030504e+0820.221500128676.0
South Asia1.701241e+0946.28750011469.0
Sub-Saharan Africa9.205996e+0876.94449032035.0

Grouping on a function of the index

Groupby operations can also be performed on transformations of the index values. In the case of a DateTimeIndex, we can extract portions of the datetime over which to group.

In this exercise you’ll read in a set of sample sales data from February 2015 and assign the ‘Date’ column as the index. Your job is to group the sales data by the day of the week and aggregate the sum of the ‘Units’ column.

Is there a day of the week that is more popular for customers? To find out, you’re going to use .strftime(‘%a’) to transform the index datetime values to abbreviated days of the week.

The sales data CSV file is available to you as ‘sales.csv’.

Instructions

  • Read ‘sales.csv’ into a DataFrame with index_col=’Date’ and parse_dates=True.
  • Create a groupby object with sales.index.strftime(‘%a’) as input and assign it to by_day.
  • Aggregate the ‘Units’ column of by_day with the .sum() method. Save the result as units_sum.
  • Print units_sum. This has been done for you, so hit ‘Submit Answer’ to see the result.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
sales_values = np.array([['2015-02-02 08:30:00', 'Hooli', 'Software', 3],
                         ['2015-02-02 21:00:00', 'Mediacore', 'Hardware', 9],
                         ['2015-02-03 14:00:00', 'Initech', 'Software', 13],
                         ['2015-02-04 15:30:00', 'Streeplex', 'Software', 13],
                         ['2015-02-04 22:00:00', 'Acme Coporation', 'Hardware', 14],
                         ['2015-02-05 02:00:00', 'Acme Coporation', 'Software', 19],
                         ['2015-02-05 22:00:00', 'Hooli', 'Service', 10],
                         ['2015-02-07 23:00:00', 'Acme Coporation', 'Hardware', 1],
                         ['2015-02-09 09:00:00', 'Streeplex', 'Service', 19],
                         ['2015-02-09 13:00:00', 'Mediacore', 'Software', 7],
                         ['2015-02-11 20:00:00', 'Initech', 'Software', 7],
                         ['2015-02-11 23:00:00', 'Hooli', 'Software', 4],
                         ['2015-02-16 12:00:00', 'Hooli', 'Software', 10],
                         ['2015-02-19 11:00:00', 'Mediacore', 'Hardware', 16],
                         ['2015-02-19 16:00:00', 'Mediacore', 'Service', 10],
                         ['2015-02-21 05:00:00', 'Mediacore', 'Software', 3],
                         ['2015-02-21 20:30:00', 'Hooli', 'Hardware', 3],
                         ['2015-02-25 00:30:00', 'Initech', 'Service', 10],
                         ['2015-02-26 09:00:00', 'Streeplex', 'Service', 4]])
sales_cols = ['Date', 'Company', 'Product', 'Units']
1
2
3
4
5
sales = pd.DataFrame(sales_values, columns=sales_cols)
sales['Date'] = pd.to_datetime(sales['Date'])
sales.set_index('Date', inplace=True)
sales['Units'] = sales['Units'].astype('int64')
sales.head()
CompanyProductUnits
Date
2015-02-02 08:30:00HooliSoftware3
2015-02-02 21:00:00MediacoreHardware9
2015-02-03 14:00:00InitechSoftware13
2015-02-04 15:30:00StreeplexSoftware13
2015-02-04 22:00:00Acme CoporationHardware14
1
sales.info()
1
2
3
4
5
6
7
8
9
10
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 19 entries, 2015-02-02 08:30:00 to 2015-02-26 09:00:00
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Company  19 non-null     object
 1   Product  19 non-null     object
 2   Units    19 non-null     int64 
dtypes: int64(1), object(2)
memory usage: 608.0+ bytes
1
2
3
4
5
6
# Create a groupby object: by_day
by_day = sales.groupby(sales.index.strftime('%a'))

# Create sum: units_sum
units_sum = by_day['Units'].sum()
units_sum
1
2
3
4
5
6
7
Date
Mon    48
Sat     7
Thu    59
Tue    13
Wed    48
Name: Units, dtype: int64

Groupby and transformation

The z-score

1
2
def zscore_def(series):
    return (series - series.mean()) / series.std()

The automobile dataset

1
2
auto = pd.read_csv(auto_mpg)
auto.head()
mpgcyldisplhpweightaccelyroriginname
018.08307.0130350412.070USchevrolet chevelle malibu
115.08350.0165369311.570USbuick skylark 320
218.08318.0150343611.070USplymouth satellite
316.08304.0150343312.070USamc rebel sst
417.08302.0140344910.570USford torino

MPG z-score

1
zscore_def(auto['mpg']).head()
1
2
3
4
5
6
0   -0.697747
1   -1.082115
2   -0.697747
3   -0.953992
4   -0.825870
Name: mpg, dtype: float64
1
2
# imported from scipy at the top
zscore(auto['mpg'])[:5]
1
2
3
4
5
6
0   -0.698638
1   -1.083498
2   -0.698638
3   -0.955212
4   -0.826925
Name: mpg, dtype: float64

MPG z-score by year

1
auto.groupby('yr')['mpg'].transform(zscore_def).head()
1
2
3
4
5
6
0    0.058125
1   -0.503753
2    0.058125
3   -0.316460
4   -0.129168
Name: mpg, dtype: float64

Apply transformation and aggregation

1
2
3
4
5
6
def zscore_with_year_and_name(group):
    df = pd.DataFrame(
        {'mpg': zscore(group['mpg']),
         'year': group['yr'],
         'name': group['name']})
    return df
1
auto.groupby('yr').apply(zscore_with_year_and_name).head()
mpgyearname
yr
7000.05915470chevrolet chevelle malibu
1-0.51267070buick skylark 320
20.05915470plymouth satellite
3-0.32206270amc rebel sst
4-0.13145470ford torino

Exercises

Detecting outliers with Z-Scores

As Dhavide demonstrated in the video using the zscore function, you can apply a .transform() method after grouping to apply a function to groups of data independently. The z-score is also useful to find outliers: a z-score value of +/- 3 is generally considered to be an outlier.

In this example, you’re going to normalize the Gapminder data in 2010 for life expectancy and fertility by the z-score per region. Using boolean indexing, you will filter out countries that have high fertility rates and low life expectancy for their region.

The Gapminder DataFrame for 2010 indexed by ‘Country’ is provided for you as gapminder_2010.

Instructions

  • Import zscore from scipy.stats.
  • Group gapminder_2010 by ‘region’ and transform the [‘life’,’fertility’] columns by zscore.
  • Construct a boolean Series of the bitwise or between standardized[‘life’] < -3 and standardized[‘fertility’] > 3.
  • Filter gapminder_2010 using .loc[] and the outliers Boolean Series. Save the result as gm_outliers.
  • Print gm_outliers. This has been done for you, so hit ‘Submit Answer’ to see the results.
1
2
3
4
5
gapminder = pd.read_csv(gapminder_data, index_col='Country')
gapminder_mask = gapminder['Year'] == 2010
gapminder_2010 = gapminder[gapminder_mask].copy()
gapminder_2010.drop('Year', axis=1, inplace=True)
gapminder_2010.head()
fertilitylifepopulationchild_mortalitygdpregion
Country
Afghanistan5.65959.61231411743.0105.01637.0South Asia
Albania1.74176.7803204284.016.69374.0Europe & Central Asia
Algeria2.81770.61535468208.027.412494.0Middle East & North Africa
Angola6.21850.68919081912.0182.57047.0Sub-Saharan Africa
Antigua and Barbuda2.13075.43788710.09.920567.0America
1
2
3
# Group gapminder_2010: standardized
standardized = gapminder_2010.groupby('region')[['life', 'fertility']].transform(zscore)
standardized.head()
lifefertility
Country
Afghanistan-1.7436012.504732
Albania0.2263670.010964
Algeria-0.440196-0.003972
Angola-0.8825371.095653
Antigua and Barbuda0.240607-0.363761
1
2
3
4
5
6
# Construct a Boolean Series to identify outliers: outliers
outliers = (standardized['life'] < -3) | (standardized['fertility'] > 3)

# Filter gapminder_2010 by the outliers: gm_outliers
gm_outliers = gapminder_2010.loc[outliers]
gm_outliers
fertilitylifepopulationchild_mortalitygdpregion
Country
Guatemala3.97471.10014388929.034.56849.0America
Haiti3.35045.0009993247.0208.81518.0America
Tajikistan3.78066.8306878637.052.62110.0Europe & Central Asia
Timor-Leste6.23765.9521124355.063.81777.0East Asia & Pacific

Filling missing data (imputation) by group

Many statistical and machine learning packages cannot determine the best action to take when missing data entries are encountered. Dealing with missing data is natural in pandas (both in using the default behavior and in defining a custom behavior). In Chapter 1, you practiced using the .dropna() method to drop missing values. Now, you will practice imputing missing values. You can use .groupby() and .transform() to fill missing data appropriately for each group.

Your job is to fill in missing ‘age’ values for passengers on the Titanic with the median age from their ‘gender’ and ‘pclass’. To do this, you’ll group by the ‘sex’ and ‘pclass’ columns and transform each group with a custom function to call .fillna() and impute the median value.

The DataFrame has been pre-loaded as titanic. Explore it in the IPython Shell by printing the output of titanic.tail(10). Notice in particular the NaNs in the ‘age’ column.

Instructions

  • Group titanic by ‘sex’ and ‘pclass’. Save the result as by_sex_class.
  • Write a function called impute_median() that fills missing values with the median of a series. This has been done for you.
  • Call .transform() with impute_median on the ‘age’ column of by_sex_class.
  • Print the output of titanic.tail(10). This has been done for you - hit ‘Submit Answer’ to see how the missing values have now been imputed.
1
2
titanic = pd.read_csv(titanic_data)
titanic.head()
pclasssurvivednamesexagesibspparchticketfarecabinembarkedboatbodyhome.dest
011Allen, Miss. Elisabeth Waltonfemale29.000024160211.3375B5S2NaNSt Louis, MO
111Allison, Master. Hudson Trevormale0.9212113781151.5500C22 C26S11NaNMontreal, PQ / Chesterville, ON
210Allison, Miss. Helen Lorainefemale2.0012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ON
310Allison, Mr. Hudson Joshua Creightonmale30.0012113781151.5500C22 C26SNaN135.0Montreal, PQ / Chesterville, ON
410Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.0012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ON
1
2
3
4
5
6
7
8
9
10
# Create a groupby object: by_sex_class
by_sex_class = titanic.groupby(['sex', 'pclass'])

# Write a function that imputes median
def impute_median(series):
    return series.fillna(series.median())

# Impute age and assign to titanic['age']
titanic.age = by_sex_class['age'].transform(impute_median)
titanic.tail(10)
pclasssurvivednamesexagesibspparchticketfarecabinembarkedboatbodyhome.dest
129930Yasbeck, Mr. Antonimale27.010265914.4542NaNCCNaNNaN
130031Yasbeck, Mrs. Antoni (Selini Alexander)female15.010265914.4542NaNCNaNNaNNaN
130130Youseff, Mr. Geriousmale45.50026287.2250NaNCNaN312.0NaN
130230Yousif, Mr. Wazlimale25.00026477.2250NaNCNaNNaNNaN
130330Yousseff, Mr. Geriousmale25.000262714.4583NaNCNaNNaNNaN
130430Zabour, Miss. Hilenifemale14.510266514.4542NaNCNaN328.0NaN
130530Zabour, Miss. Thaminefemale22.010266514.4542NaNCNaNNaNNaN
130630Zakarian, Mr. Mapriededermale26.50026567.2250NaNCNaN304.0NaN
130730Zakarian, Mr. Ortinmale27.00026707.2250NaNCNaNNaNNaN
130830Zimmerman, Mr. Leomale29.0003150827.8750NaNSNaNNaNNaN

Other transformations with .apply

The .apply() method when used on a groupby object performs an arbitrary function on each of the groups. These functions can be aggregations, transformations or more complex workflows. The .apply() method will then combine the results in an intelligent way.

In this exercise, you’re going to analyze economic disparity within regions of the world using the Gapminder data set for 2010. To do this you’ll define a function to compute the aggregate spread of per capita GDP in each region and the individual country’s z-score of the regional per capita GDP. You’ll then select three countries - United States, Great Britain and China - to see a summary of the regional GDP and that country’s z-score against the regional mean.

The 2010 Gapminder DataFrame is provided for you as gapminder_2010. Pandas has been imported as pd.

The following function has been defined for your use:

1
2
3
4
5
6
7
def disparity(gr):
    # Compute the spread of gr['gdp']: s
    s = gr['gdp'].max() - gr['gdp'].min()
    # Compute the z-score of gr['gdp'] as (gr['gdp']-gr['gdp'].mean())/gr['gdp'].std(): z
    z = (gr['gdp'] - gr['gdp'].mean())/gr['gdp'].std()
    # Return a DataFrame with the inputs {'z(gdp)':z, 'regional spread(gdp)':s}
    return pd.DataFrame({'z(gdp)':z , 'regional spread(gdp)':s})

Instructions

  • Group gapminder_2010 by ‘region’. Save the result as regional.
  • Apply the provided disparity function on regional, and save the result as reg_disp.
  • Use .loc[] to select [‘United States’,’United Kingdom’,’China’] from reg_disp and print the results.
1
2
3
4
5
6
7
def disparity(gr):
    # Compute the spread of gr['gdp']: s
    s = gr['gdp'].max() - gr['gdp'].min()
    # Compute the z-score of gr['gdp'] as (gr['gdp']-gr['gdp'].mean())/gr['gdp'].std(): z
    z = (gr['gdp'] - gr['gdp'].mean())/gr['gdp'].std()
    # Return a DataFrame with the inputs {'z(gdp)':z, 'regional spread(gdp)':s}
    return pd.DataFrame({'z(gdp)':z , 'regional spread(gdp)':s})
1
gapminder_2010.head()
fertilitylifepopulationchild_mortalitygdpregion
Country
Afghanistan5.65959.61231411743.0105.01637.0South Asia
Albania1.74176.7803204284.016.69374.0Europe & Central Asia
Algeria2.81770.61535468208.027.412494.0Middle East & North Africa
Angola6.21850.68919081912.0182.57047.0Sub-Saharan Africa
Antigua and Barbuda2.13075.43788710.09.920567.0America
1
2
3
4
5
6
7
8
9
10
11
12
13
# Group gapminder_2010 by 'region': regional
regional = gapminder_2010.groupby('region')

# Apply the disparity function on regional: reg_disp
reg_disp = regional.apply(disparity)

# Get the 'Country' level values from the DataFrame's index
countries = reg_disp.index.get_level_values('Country')

# Now you can use 'regions' in your operations
reg_disp = reg_disp.loc[countries.isin(['United States', 'United Kingdom', 'China'])]

reg_disp
z(gdp)regional spread(gdp)
regionCountry
AmericaUnited States3.01337447855.0
East Asia & PacificChina-0.43275696993.0
Europe & Central AsiaUnited Kingdom0.57287389037.0

Groupby and filterning

The automobile dataset

1
auto.head()
mpgcyldisplhpweightaccelyroriginname
018.08307.0130350412.070USchevrolet chevelle malibu
115.08350.0165369311.570USbuick skylark 320
218.08318.0150343611.070USplymouth satellite
316.08304.0150343312.070USamc rebel sst
417.08302.0140344910.570USford torino

Mean MPG by year

1
auto.groupby('yr')['mpg'].mean()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
yr
70    17.689655
71    21.111111
72    18.714286
73    17.100000
74    22.769231
75    20.266667
76    21.573529
77    23.375000
78    24.061111
79    25.093103
80    33.803704
81    30.185714
82    32.000000
Name: mpg, dtype: float64

groupby object

1
2
splitting = auto.groupby('yr')
type(splitting)
1
pandas.core.groupby.generic.DataFrameGroupBy
1
type(splitting.groups)
1
pandas.io.formats.printing.PrettyDict
1
print(splitting.groups.keys())
1
dict_keys([70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82])

groupby object: iteration

1
2
3
for group_name, group in splitting:
    avg = group['mpg'].mean()
    print(group_name, avg)
1
2
3
4
5
6
7
8
9
10
11
12
13
70 17.689655172413794
71 21.11111111111111
72 18.714285714285715
73 17.1
74 22.76923076923077
75 20.266666666666666
76 21.573529411764707
77 23.375
78 24.061111111111114
79 25.09310344827585
80 33.803703703703704
81 30.18571428571429
82 32.0

groupby object: iteration and filtering

1
2
3
for group_name, group in splitting:
    avg = group.loc[group['name'].str.contains('chevrolet'), 'mpg'].mean()
    print(group_name, avg)
1
2
3
4
5
6
7
8
9
10
11
12
13
70 15.666666666666666
71 20.25
72 15.333333333333334
73 14.833333333333334
74 18.666666666666668
75 17.666666666666668
76 23.25
77 20.25
78 23.233333333333334
79 21.666666666666668
80 30.05
81 23.5
82 29.0

groupby object: comprehension

1
2
chevy_means = {year:group.loc[group['name'].str.contains('chevrolet'),'mpg'].mean() for year,group in splitting}
pd.Series(chevy_means)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
70    15.666667
71    20.250000
72    15.333333
73    14.833333
74    18.666667
75    17.666667
76    23.250000
77    20.250000
78    23.233333
79    21.666667
80    30.050000
81    23.500000
82    29.000000
dtype: float64

Boolean groupby

1
2
chevy = auto['name'].str.contains('chevrolet')
auto.groupby(['yr', chevy])['mpg'].mean()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
yr  name 
70  False    17.923077
    True     15.666667
71  False    21.260870
    True     20.250000
72  False    19.120000
    True     15.333333
73  False    17.500000
    True     14.833333
74  False    23.304348
    True     18.666667
75  False    20.555556
    True     17.666667
76  False    21.350000
    True     23.250000
77  False    23.895833
    True     20.250000
78  False    24.136364
    True     23.233333
79  False    25.488462
    True     21.666667
80  False    34.104000
    True     30.050000
81  False    30.433333
    True     23.500000
82  False    32.461538
    True     29.000000
Name: mpg, dtype: float64

Exercises

Grouping and filtering with .apply()

By using .apply(), you can write functions that filter rows within groups. The .apply() method will handle the iteration over individual groups and then re-combine them back into a Series or DataFrame.

In this exercise you’ll take the Titanic data set and analyze survival rates from the ‘C’ deck, which contained the most passengers. To do this you’ll group the dataset by ‘sex’ and then use the .apply() method on a provided user defined function which calculates the mean survival rates on the ‘C’ deck:

1
2
3
def c_deck_survival(gr):
    c_passengers = gr['cabin'].str.startswith('C').fillna(False)
    return gr.loc[c_passengers, 'survived'].mean()

The DataFrame has been pre-loaded as titanic.

Instructions

  • Group titanic by ‘sex’. Save the result as by_sex.
  • Apply the provided c_deck_survival function on the by_sex DataFrame. Save the result as c_surv_by_sex.
  • Print c_surv_by_sex.
1
2
titanic = pd.read_csv(titanic_data)
titanic.head(3)
pclasssurvivednamesexagesibspparchticketfarecabinembarkedboatbodyhome.dest
011Allen, Miss. Elisabeth Waltonfemale29.000024160211.3375B5S2NaNSt Louis, MO
111Allison, Master. Hudson Trevormale0.9212113781151.5500C22 C26S11NaNMontreal, PQ / Chesterville, ON
210Allison, Miss. Helen Lorainefemale2.0012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ON
1
2
3
def c_deck_survival(gr):
    c_passengers = gr['cabin'].str.startswith('C').fillna(False)
    return gr.loc[c_passengers, 'survived'].mean()
1
2
3
4
5
6
# Create a groupby object using titanic over the 'sex' column: by_sex
by_sex = titanic.groupby('sex')

# Call by_sex.apply with the function c_deck_survival
c_surv_by_sex = by_sex.apply(c_deck_survival)
c_surv_by_sex.head()
1
2
3
4
sex
female    0.913043
male      0.312500
dtype: float64

Grouping and filtering with .filter()

You can use groupby with the .filter() method to remove whole groups of rows from a DataFrame based on a boolean condition.

In this exercise, you’ll take the February sales data and remove entries from companies that purchased less than or equal to 35 Units in the whole month.

First, you’ll identify how many units each company bought for verification. Next you’ll use the .filter() method after grouping by ‘Company’ to remove all rows belonging to companies whose sum over the ‘Units’ column was less than or equal to 35. Finally, verify that the three companies whose total Units purchased were less than or equal to 35 have been filtered out from the DataFrame.

Instructions

  • Group sales by ‘Company’. Save the result as by_company.
  • Compute and print the sum of the ‘Units’ column of by_company.
  • Call .filter() on by_company with lambda g:g[‘Units’].sum() > 35 as input and print the result.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
sales_values = np.array([['2015-02-02 08:30:00', 'Hooli', 'Software', 3],
                         ['2015-02-02 21:00:00', 'Mediacore', 'Hardware', 9],
                         ['2015-02-03 14:00:00', 'Initech', 'Software', 13],
                         ['2015-02-04 15:30:00', 'Streeplex', 'Software', 13],
                         ['2015-02-04 22:00:00', 'Acme Coporation', 'Hardware', 14],
                         ['2015-02-05 02:00:00', 'Acme Coporation', 'Software', 19],
                         ['2015-02-05 22:00:00', 'Hooli', 'Service', 10],
                         ['2015-02-07 23:00:00', 'Acme Coporation', 'Hardware', 1],
                         ['2015-02-09 09:00:00', 'Streeplex', 'Service', 19],
                         ['2015-02-09 13:00:00', 'Mediacore', 'Software', 7],
                         ['2015-02-11 20:00:00', 'Initech', 'Software', 7],
                         ['2015-02-11 23:00:00', 'Hooli', 'Software', 4],
                         ['2015-02-16 12:00:00', 'Hooli', 'Software', 10],
                         ['2015-02-19 11:00:00', 'Mediacore', 'Hardware', 16],
                         ['2015-02-19 16:00:00', 'Mediacore', 'Service', 10],
                         ['2015-02-21 05:00:00', 'Mediacore', 'Software', 3],
                         ['2015-02-21 20:30:00', 'Hooli', 'Hardware', 3],
                         ['2015-02-25 00:30:00', 'Initech', 'Service', 10],
                         ['2015-02-26 09:00:00', 'Streeplex', 'Service', 4]])
sales_cols = ['Date', 'Company', 'Product', 'Units']
1
2
3
4
5
sales = pd.DataFrame(sales_values, columns=sales_cols)
sales['Date'] = pd.to_datetime(sales['Date'])
sales.set_index('Date', inplace=True)
sales['Units'] = sales['Units'].astype('int64')
sales.head()
CompanyProductUnits
Date
2015-02-02 08:30:00HooliSoftware3
2015-02-02 21:00:00MediacoreHardware9
2015-02-03 14:00:00InitechSoftware13
2015-02-04 15:30:00StreeplexSoftware13
2015-02-04 22:00:00Acme CoporationHardware14
1
2
3
4
5
6
# Group sales by 'Company': by_company
by_company = sales.groupby('Company')

# Compute the sum of the 'Units' of by_company: by_com_sum
by_com_sum = by_company['Units'].sum()
by_com_sum
1
2
3
4
5
6
7
Company
Acme Coporation    34
Hooli              30
Initech            30
Mediacore          45
Streeplex          36
Name: Units, dtype: int64
1
2
3
# Filter 'Units' where the sum is > 35: by_com_filt
by_com_filt = by_company.filter(lambda g: g['Units'].sum() > 35)
by_com_filt
CompanyProductUnits
Date
2015-02-02 21:00:00MediacoreHardware9
2015-02-04 15:30:00StreeplexSoftware13
2015-02-09 09:00:00StreeplexService19
2015-02-09 13:00:00MediacoreSoftware7
2015-02-19 11:00:00MediacoreHardware16
2015-02-19 16:00:00MediacoreService10
2015-02-21 05:00:00MediacoreSoftware3
2015-02-26 09:00:00StreeplexService4

Filtering and grouping with .map()

You have seen how to group by a column, or by multiple columns. Sometimes, you may instead want to group by a function/transformation of a column. The key here is that the Series is indexed the same way as the DataFrame. You can also mix and match column grouping with Series grouping.

In this exercise your job is to investigate survival rates of passengers on the Titanic by ‘age’ and ‘pclass’. In particular, the goal is to find out what fraction of children under 10 survived in each ‘pclass’. You’ll do this by first creating a boolean array where True is passengers under 10 years old and False is passengers over 10. You’ll use .map() to change these values to strings.

Finally, you’ll group by the under 10 series and the ‘pclass’ column and aggregate the ‘survived’ column. The ‘survived’ column has the value 1 if the passenger survived and 0 otherwise. The mean of the ‘survived’ column is the fraction of passengers who lived.

The DataFrame has been pre-loaded for you as titanic.

Instructions

  • Create a Boolean Series of titanic[‘age’] < 10 and call .map with {True:’under 10’, False:’over 10’}.
  • Group titanic by the under10 Series and then compute and print the mean of the ‘survived’ column.
  • Group titanic by the under10 Series as well as the ‘pclass’ column and then compute and print the mean of the ‘survived’ column.
1
titanic.head(3)
pclasssurvivednamesexagesibspparchticketfarecabinembarkedboatbodyhome.dest
011Allen, Miss. Elisabeth Waltonfemale29.000024160211.3375B5S2NaNSt Louis, MO
111Allison, Master. Hudson Trevormale0.9212113781151.5500C22 C26S11NaNMontreal, PQ / Chesterville, ON
210Allison, Miss. Helen Lorainefemale2.0012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ON
1
2
3
# Create the Boolean Series: under10
under10 = (titanic['age'] < 10).map({True:'under 10', False:'over 10'})
under10.head()
1
2
3
4
5
6
0     over 10
1    under 10
2    under 10
3     over 10
4     over 10
Name: age, dtype: object
1
2
3
# Group by under10 and compute the survival rate
survived_mean_1 = titanic.groupby(under10)['survived'].mean()
survived_mean_1
1
2
3
4
age
over 10     0.366748
under 10    0.609756
Name: survived, dtype: float64
1
2
3
# Group by under10 and pclass and compute the survival rate
survived_mean_2 = titanic.groupby([under10, 'pclass'])['survived'].mean()
survived_mean_2
1
2
3
4
5
6
7
8
age       pclass
over 10   1         0.617555
          2         0.380392
          3         0.238897
under 10  1         0.750000
          2         1.000000
          3         0.446429
Name: survived, dtype: float64

Bringing it all together

Here, you will bring together everything you have learned in this course while working with data recorded from the Summer Olympic games that goes as far back as 1896! This is a rich dataset that will allow you to fully apply the data manipulation techniques you have learned. You will pivot, unstack, group, slice, and reshape your data as you explore this dataset and uncover some truly fascinating insights. Enjoy!

Case Study - Summer Olympics

Olympic medals dataset

1
2
medals = pd.read_csv(medals_data)
medals.head()
CityEditionSportDisciplineAthleteNOCGenderEventEvent_genderMedal
0Athens1896AquaticsSwimmingHAJOS, AlfredHUNMen100m freestyleMGold
1Athens1896AquaticsSwimmingHERSCHMANN, OttoAUTMen100m freestyleMSilver
2Athens1896AquaticsSwimmingDRIVAS, DimitriosGREMen100m freestyle for sailorsMBronze
3Athens1896AquaticsSwimmingMALOKINIS, IoannisGREMen100m freestyle for sailorsMGold
4Athens1896AquaticsSwimmingCHASAPIS, SpiridonGREMen100m freestyle for sailorsMSilver

Reminder: indexing & pivoting

  • Filtering and indexing
    • One-level indexing
    • Multi-level indexing
  • Reshaping DataFrames with pivot()
  • pivot_table()

Reminder: groupby

  • Useful DataFrame methods
    • unique()
    • value_counts()
  • Aggregations, transformations, filtering

Case Study Explorations

Grouping and aggregating

The Olympic medal data for the following exercises comes from The Guardian. It comprises records of all events held at the Olympic games between 1896 and 2012.

Suppose you have loaded the data into a DataFrame medals. You now want to find the total number of medals awarded to the USA per edition. To do this, filter the ‘USA’ rows and use the groupby() function to put the ‘Edition’ column on the index:

1
USA_edition_grouped = medals.loc[medals.NOC == 'USA'].groupby('Edition')

Given the goal of finding the total number of USA medals awarded per edition, what column should you select and which aggregation method should you use?

Instructions

Possible Answers

1
2
3
4
USA_edition_grouped['City'].mean()
USA_edition_grouped['Athlete'].sum()
USA_edition_grouped['Medal'].count()
USA_edition_grouped['Gender'].first()
1
USA_edition_grouped = medals.loc[medals.NOC == 'USA'].groupby('Edition')
1
USA_edition_grouped['Medal'].count().head()
1
2
3
4
5
6
7
Edition
1896     20
1900     55
1904    394
1908     63
1912    101
Name: Medal, dtype: int64

Using .value_counts() for ranking

For this exercise, you will use the pandas Series method .value_counts() to determine the top 15 countries ranked by total number of medals.

Notice that .value_counts() sorts by values by default. The result is returned as a Series of counts indexed by unique entries from the original Series with values (counts) ranked in descending order.

The DataFrame has been pre-loaded for you as medals.

Instructions

  • Extract the ‘NOC’ column from the DataFrame medals and assign the result to country_names. Notice that this Series has repeated entries for every medal (of any type) a country has won in any Edition of the Olympics.
  • Create a Series medal_counts by applying .value_counts() to the Series country_names.
  • Print the top 15 countries ranked by total number of medals won. This has been done for you, so hit ‘Submit Answer’ to see the result.
1
2
3
# Select the 'NOC' column of medals: country_names
country_names = medals['NOC']
country_names.head()
1
2
3
4
5
6
0    HUN
1    AUT
2    GRE
3    GRE
4    GRE
Name: NOC, dtype: object
1
2
3
4
5
# Count the number of medals won by each country: medal_counts
medal_counts = country_names.value_counts()

# Print top 15 countries ranked by medals
print(medal_counts.head(15))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
NOC
USA    4335
URS    2049
GBR    1594
FRA    1314
ITA    1228
GER    1211
AUS    1075
HUN    1053
SWE    1021
GDR     825
NED     782
JPN     704
CHN     679
RUS     638
ROU     624
Name: count, dtype: int64

Using .pivot_table() to count medals by type

Rather than ranking countries by total medals won and showing that list, you may want to see a bit more detail. You can use a pivot table to compute how many separate bronze, silver and gold medals each country won. That pivot table can then be used to repeat the previous computation to rank by total medals won.

In this exercise, you will use .pivot_table() first to aggregate the total medals by type. Then, you can use .sum() along the columns of the pivot table to produce a new column. When the modified pivot table is sorted by the total medals column, you can display the results from the last exercise with a bit more detail.

Instructions

  • Construct a pivot table counted from the DataFrame medals aggregating by count. Use ‘NOC’ as the index, ‘Athlete’ for the values, and ‘Medal’ for the columns.
  • Modify the DataFrame counted by adding a column counted[‘totals’]. The new column ‘totals’ should contain the result of taking the sum along the columns (i.e., use .sum(axis=’columns’)).
  • Overwrite the DataFrame counted by sorting it with the .sort_values() method. Specify the keyword argument ascending=False.
  • Print the first 15 rows of counted using .head(15). This has been done for you, so hit ‘Submit Answer’ to see the result.
1
2
3
4
5
6
# Construct the pivot table: counted
counted = medals.pivot_table(index='NOC',
                             columns='Medal',
                             values='Athlete',
                             aggfunc='count')
counted.head()
MedalBronzeGoldSilver
NOC
AFG1.0NaNNaN
AHONaNNaN1.0
ALG8.04.02.0
ANZ5.020.04.0
ARG88.068.083.0
1
2
3
# Create the new column: counted['totals']
counted['totals'] = counted.sum(axis='columns')
counted.head()
MedalBronzeGoldSilvertotals
NOC
AFG1.0NaNNaN1.0
AHONaNNaN1.01.0
ALG8.04.02.014.0
ANZ5.020.04.029.0
ARG88.068.083.0239.0
1
2
3
# Sort counted by the 'totals' column
counted = counted.sort_values('totals', ascending=False)
counted.head()
MedalBronzeGoldSilvertotals
NOC
USA1052.02088.01195.04335.0
URS584.0838.0627.02049.0
GBR505.0498.0591.01594.0
FRA475.0378.0461.01314.0
ITA374.0460.0394.01228.0

Understanding the column labels

‘Gender’ and ‘Event_gender’

1
medals.loc[145:154, ['NOC', 'Gender', 'Event', 'Event_gender', 'Medal']]
NOCGenderEventEvent_genderMedal
145GREMenheavyweight - two hand liftMBronze
146DENMenheavyweight - two hand liftMGold
147GBRMenheavyweight - two hand liftMSilver
148GREMenopen eventMBronze
149GERMenopen eventMGold
150GREMenopen eventMSilver
151HUNMen1500m freestyleMBronze
152GBRMen1500m freestyleMGold
153AUTMen1500m freestyleMSilver
154NEDMen200m backstrokeMBronze

Reminder: slicing & filtering

  • Indexing and slicing
    • .loc[] and .iloc[] accessors
  • Filtering
    • Selecting by Boolean Series
    • Filtering null/non-null and zero/non-zero values

Reminder: Handling categorical data

  • Useful DataFrame methods for handling categorical data:
    • value_counts()
    • unique()
    • groupby()
  • groupby() aggregations:
    • mean(), std(), count()

Case Study Explorations

Applying .drop_duplicates()

What could be the difference between the ‘Event_gender’ and ‘Gender’ columns? You should be able to evaluate your guess by looking at the unique values of the pairs (Event_gender, Gender) in the data. In particular, you should not see something like (Event_gender=’M’, Gender=’Women’). However, you will see that, strangely enough, there is an observation with (Event_gender=’W’, Gender=’Men’).

The duplicates can be dropped using the .drop_duplicates() method, leaving behind the unique observations. The DataFrame has been loaded as medals.

Instructions

  • Select the columns ‘Event_gender’ and ‘Gender’.
  • Create a dataframe ev_gen_uniques containing the unique pairs contained in ev_gen.
  • Print ev_gen_uniques. This has been done for you, so hit ‘Submit Answer’ to see the result.
1
2
medals = pd.read_csv(medals_data)
medals.head()
CityEditionSportDisciplineAthleteNOCGenderEventEvent_genderMedal
0Athens1896AquaticsSwimmingHAJOS, AlfredHUNMen100m freestyleMGold
1Athens1896AquaticsSwimmingHERSCHMANN, OttoAUTMen100m freestyleMSilver
2Athens1896AquaticsSwimmingDRIVAS, DimitriosGREMen100m freestyle for sailorsMBronze
3Athens1896AquaticsSwimmingMALOKINIS, IoannisGREMen100m freestyle for sailorsMGold
4Athens1896AquaticsSwimmingCHASAPIS, SpiridonGREMen100m freestyle for sailorsMSilver
1
2
3
# Select columns: ev_gen
ev_gen = medals[['Event_gender', 'Gender']]
ev_gen.head()
Event_genderGender
0MMen
1MMen
2MMen
3MMen
4MMen
1
2
3
# Drop duplicate pairs: ev_gen_uniques
ev_gen_uniques = ev_gen.drop_duplicates()
ev_gen_uniques
Event_genderGender
0MMen
348XMen
416WWomen
639XWomen
23675WMen

Finding possible errors with .groupby()

You will now use .groupby() to continue your exploration. Your job is to group by ‘Event_gender’ and ‘Gender’ and count the rows.

You will see that there is only one suspicious row: This is likely a data error.

The DataFrame is available to you as medals.

Instructions

  • Group medals by ‘Event_gender’ and ‘Gender’.
  • Create a medal_count_by_gender DataFrame with a group count using the .count() method.
  • Print medal_count_by_gender. This has been done for you, so hit ‘Submit Answer’ to view the result.
1
2
3
4
5
6
# Group medals by the two columns: medals_by_gender
medals_by_gender = medals.groupby(['Event_gender', 'Gender'])

# Create a DataFrame with a group count: medal_count_by_gender
medal_count_by_gender = medals_by_gender.count()
medal_count_by_gender
CityEditionSportDisciplineAthleteNOCEventMedal
Event_genderGender
MMen2006720067200672006720067200672006720067
WMen11111111
Women72777277727772777277727772777277
XMen16531653165316531653165316531653
Women218218218218218218218218

Locating suspicious data

You will now inspect the suspect record by locating the offending row.

You will see that, according to the data, Joyce Chepchumba was a man that won a medal in a women’s event. That is a data error as you can confirm with a web search.

Instructions

  • Create a Boolean Series with a condition that captures the only row that has medals.Event_gender == ‘W’ and medals.Gender == ‘Men’. Be sure to use the & operator.
  • Use the Boolean Series to create a DataFrame called suspect with the suspicious row.
  • Print suspect. This has been done for you, so hit ‘Submit Answer’ to see the result.
1
2
3
4
5
6
# Create the Boolean Series: sus
sus = (medals.Event_gender == 'W') & (medals.Gender == 'Men')

# Create a DataFrame with the suspicious row: suspect
suspect = medals[sus]
suspect
CityEditionSportDisciplineAthleteNOCGenderEventEvent_genderMedal
23675Sydney2000AthleticsAthleticsCHEPCHUMBA, JoyceKENMenmarathonWBronze

Constructing alternative country rankings

Counting distinct events

1
2
sports = medals['Sport'].unique()
sports
1
2
3
4
5
6
7
8
9
array(['Aquatics', 'Athletics', 'Cycling', 'Fencing', 'Gymnastics',
       'Shooting', 'Tennis', 'Weightlifting', 'Wrestling', 'Archery',
       'Basque Pelota', 'Cricket', 'Croquet', 'Equestrian', 'Football',
       'Golf', 'Polo', 'Rowing', 'Rugby', 'Sailing', 'Tug of War',
       'Boxing', 'Lacrosse', 'Roque', 'Hockey', 'Jeu de paume', 'Rackets',
       'Skating', 'Water Motorsports', 'Modern Pentathlon', 'Ice Hockey',
       'Basketball', 'Canoe / Kayak', 'Handball', 'Judo', 'Volleyball',
       'Table Tennis', 'Badminton', 'Baseball', 'Softball', 'Taekwondo',
       'Triathlon'], dtype=object)

Ranking of distinct events

  • Top five countries that have won medals in the most sports
  • Compare medal counts of USA and USSR from 1952 to 1988

Two new DataFrame methods

  • idxmax(): Row or column label where maximum value is located
  • idxmin(): Row or column label where minimum value is located

idxmax() Example

1
2
3
4
5
6
weather = pd.DataFrame({'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                                  'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
                        'Mean TemperatureF': [32.354839, 28.714286, 35.000000, 53.100000, 62.612903, 70.133333,
                                              72.870968, 70.000000, 63.766667, 55.451613, 39.800000, 34.935484]})
weather.set_index('Month', inplace=True)
weather
Mean TemperatureF
Month
Jan32.354839
Feb28.714286
Mar35.000000
Apr53.100000
May62.612903
Jun70.133333
Jul72.870968
Aug70.000000
Sep63.766667
Oct55.451613
Nov39.800000
Dec34.935484

Using idxmax()

1
weather.idxmax()  # Returns month of highest temperature
1
2
Mean TemperatureF    Jul
dtype: object

Using idxmax() along columns

1
weather.T  # Returns DataFrame with single row, 12 columns
MonthJanFebMarAprMayJunJulAugSepOctNovDec
Mean TemperatureF32.35483928.71428635.053.162.61290370.13333372.87096870.063.76666755.45161339.834.935484
1
weather.T.idxmax(axis='columns')
1
2
Mean TemperatureF    Jul
dtype: object

Using idxmin()

1
weather.T.idxmin(axis='columns')
1
2
Mean TemperatureF    Feb
dtype: object

Case Study Explorations

Using .nunique() to rank by distinct sports

You may want to know which countries won medals in the most distinct sports. The .nunique() method is the principal aggregation here. Given a categorical Series S, S.nunique() returns the number of distinct categories.

Instructions

  • Group medals by ‘NOC’.
  • Compute the number of distinct sports in which each country won medals. To do this, select the ‘Sport’ column from country_grouped and apply .nunique().
  • Sort Nsports in descending order with .sort_values() and ascending=False.
  • Print the first 15 rows of Nsports. This has been done for you, so hit ‘Submit Answer’ to see the result.
1
2
3
4
5
6
7
8
9
10
11
# Group medals by 'NOC': country_grouped
country_grouped = medals.groupby('NOC')

# Compute the number of distinct sports in which each country won medals: Nsports
Nsports = country_grouped['Sport'].nunique()

# Sort the values of Nsports in descending order
Nsports = Nsports.sort_values(ascending=False)

# Print the top 15 rows of Nsports
print(Nsports.head(15))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
NOC
USA    34
GBR    31
FRA    28
GER    26
CHN    24
CAN    22
AUS    22
ESP    22
SWE    21
ITA    21
URS    21
NED    20
RUS    20
JPN    20
TCH    19
Name: Sport, dtype: int64

The USSR is not in the top 5 in this category, while the USA continues to remain on top. What could be the cause of this? You’ll compare the medal counts of USA vs. USSR more closely in the next two exercises to find out.

Counting USA vs. USSR Cold War Olympic Sports

The Olympic competitions between 1952 and 1988 took place during the height of the Cold War between the United States of America (USA) & the Union of Soviet Socialist Republics (USSR). Your goal in this exercise is to aggregate the number of distinct sports in which the USA and the USSR won medals during the Cold War years.

The construction is mostly the same as in the preceding exercise. There is an additional filtering stage beforehand in which you reduce the original DataFrame medals by extracting data from the Cold War period that applies only to the US or to the USSR. The relevant country codes in the DataFrame, which has been pre-loaded as medals, are ‘USA’ & ‘URS’.

Instructions

  • Using medals, create a Boolean Series called during_cold_war that is True when ‘Edition’ >= 1952 & <= 1988.
  • Using medals, create a Boolean Series called is_usa_urs that is True when ‘NOC’ is either ‘USA’ or ‘URS’.
  • Filter the medals DataFrame using during_cold_war and is_usa_urs to create a new DataFrame called cold_war_medals.
  • Group cold_war_medals by ‘NOC’.
  • Create a Series Nsports from country_grouped using indexing & chained methods:
    • Extract the column ‘Sport’.
    • Use .nunique() to get the number of unique elements in each group;
    • Apply .sort_values(ascending=False) to rearrange the Series.
  • Print the final Series Nsports. This has been done for you, so hit ‘Submit Answer’ to see the result!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Extract all rows for which the 'Edition' is between 1952 & 1988: during_cold_war
during_cold_war = (1952 <= medals.Edition) & (medals.Edition <= 1988)

# Extract rows for which 'NOC' is either 'USA' or 'URS': is_usa_urs
is_usa_urs = medals.NOC.isin(['USA', 'URS'])

# Use during_cold_war and is_usa_urs to create the DataFrame: cold_war_medals
cold_war_medals = medals.loc[during_cold_war & is_usa_urs]

# Group cold_war_medals by 'NOC'
country_grouped = cold_war_medals.groupby('NOC')

# Create Nsports
Nsports = country_grouped['Sport'].nunique().sort_values(ascending=False)
Nsports
1
2
3
4
NOC
URS    21
USA    20
Name: Sport, dtype: int64

As you can see, the USSR is actually higher than the US when you look only at the Olympic competitions between 1952 and 1988.

Counting USA vs. USSR Cold War Olympic Medals

For this exercise, you want to see which country, the USA or the USSR, won the most medals consistently over the Cold War period.

There are several steps involved in carrying out this computation.

  • You’ll need a pivot table with years (‘Edition’) on the index and countries (‘NOC’) on the columns. The entries will be the total number of medals each country won that year. If the country won no medals in a given edition, expect a NaN in that entry of the pivot table.
  • You’ll need to slice the Cold War period and subset the ‘USA’ and ‘URS’ columns.
  • You’ll need to make a Series from this slice of the pivot table that tells which country won the most medals in that edition using .idxmax(axis=’columns’). If .max() returns the maximum value of Series or 1D array, .idxmax() returns the index of the maximizing element. The argument axis=columns or axis=1 is required because, by default, this aggregation would be done along columns for a DataFrame.
  • The final Series contains either ‘USA’ or ‘URS’ according to which country won the most medals in each Olympic edition. You can use .value_counts() to count the number of occurrences of each.

Instructions

  • Construct medals_won_by_country using medals.pivot_table().
    • The index should be the years (‘Edition’) & the columns should be country (‘NOC’)
    • The values should be ‘Athlete’ (which captures every medal regardless of kind) & the aggregation method should be ‘count’ (which captures the total number of medals won).
  • Create cold_war_usa_urs_medals by slicing the pivot table medals_won_by_country. Your slice should contain the editions from years 1952:1988 and only the columns ‘USA’ & ‘URS’ from the pivot table.
  • Create the Series most_medals by applying the .idxmax() method to cold_war_usa_urs_medals. Be sure to use axis=’columns’.
  • Print the result of applying .value_counts() to most_medals. The result reported gives the number of times each of the USA or the USSR won more Olympic medals in total than the other between 1952 and 1988.
1
2
3
4
5
6
# Create the pivot table: medals_won_by_country
medals_won_by_country = medals.pivot_table(index='Edition',
                                           columns='NOC',
                                           values='Athlete',
                                           aggfunc='count')
medals_won_by_country.head()
NOCAFGAHOALGANZARGARMAUSAUTAZEBAHBARBDIBELBERBLRBOHBRABULBWICANCHICHNCIVCMRCOL...SWESYRTANTCHTGATHATJKTOGTPETRITUNTURUAEUGAUKRURSURUUSAUZBVENVIEYUGZAMZIMZZX
Edition
1896NaNNaNNaNNaNNaNNaN2.05.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN20.0NaNNaNNaNNaNNaNNaN6.0
1900NaNNaNNaNNaNNaNNaN5.06.0NaNNaNNaNNaN39.0NaNNaN2.0NaNNaNNaN2.0NaNNaNNaNNaNNaN...1.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN55.0NaNNaNNaNNaNNaNNaN34.0
1904NaNNaNNaNNaNNaNNaNNaN1.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN35.0NaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN394.0NaNNaNNaNNaNNaNNaN8.0
1908NaNNaNNaN19.0NaNNaNNaN1.0NaNNaNNaNNaN31.0NaNNaN5.0NaNNaNNaN51.0NaNNaNNaNNaNNaN...98.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN63.0NaNNaNNaNNaNNaNNaNNaN
1912NaNNaNNaN10.0NaNNaNNaN14.0NaNNaNNaNNaN19.0NaNNaNNaNNaNNaNNaN8.0NaNNaNNaNNaNNaN...173.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN101.0NaNNaNNaNNaNNaNNaNNaN

5 rows × 138 columns

1
2
3
4
5
6
7
8
# Slice medals_won_by_country: cold_war_usa_urs_medals
cold_war_usa_urs_medals = medals_won_by_country.loc[1952:1988 , ['USA','URS']]

# Create most_medals 
most_medals = cold_war_usa_urs_medals.idxmax(axis='columns')

# Print most_medals.value_counts()
most_medals.value_counts()
1
2
3
URS    8
USA    2
Name: count, dtype: int64

Reshaping DataFrames for visualization

Data

1
medals = pd.read_csv(medals_data)

Reminder: plotting DataFrames

1
2
all_medals = medals.groupby('Edition')['Athlete'].count()
all_medals.head(6) # Series for all medals, all years
1
2
3
4
5
6
7
8
Edition
1896     151
1900     512
1904     470
1908     804
1912     885
1920    1298
Name: Athlete, dtype: int64

Plotting DataFrames

1
2
all_medals.plot(kind='line', marker='.')
plt.show()

png

Grouping the data

1
2
3
france = medals.NOC == 'FRA'  # Boolean Series for France
france_grps = medals[france].groupby(['Edition', 'Medal'])
france_grps['Athlete'].count().head(10)
1
2
3
4
5
6
7
8
9
10
11
12
Edition  Medal 
1896     Bronze     2
         Gold       5
         Silver     4
1900     Bronze    53
         Gold      46
         Silver    86
1908     Bronze    21
         Gold       9
         Silver     5
1912     Bronze     5
Name: Athlete, dtype: int64

Reshaping the data

1
2
france_medals = france_grps['Athlete'].count().unstack()
france_medals.head(12)  # Single level index
MedalBronzeGoldSilver
Edition
18962.05.04.0
190053.046.086.0
190821.09.05.0
19125.010.010.0
192055.013.073.0
192420.039.063.0
192813.07.016.0
19326.023.08.0
193618.012.013.0
194821.025.022.0
195216.014.09.0
195613.06.013.0

Plotting the result

1
2
france_medals.plot(kind='line', marker='.')
plt.show()

png

Case Study Explorations

Visualizing USA Medal Counts by Edition: Line Plot

Your job in this exercise is to visualize the medal counts by ‘Edition’ for the USA. The DataFrame has been pre-loaded for you as medals.

Instructions

  • Create a DataFrame usa with data only for the USA.
  • Group usa such that [‘Edition’, ‘Medal’] is the index. Aggregate the count over ‘Athlete’.
  • Use .unstack() with level=’Medal’ to reshape the DataFrame usa_medals_by_year.
  • Construct a line plot from the final DataFrame usa_medals_by_year. This has been done for you, so hit ‘Submit Answer’ to see the plot!
1
medals.head()
CityEditionSportDisciplineAthleteNOCGenderEventEvent_genderMedal
0Athens1896AquaticsSwimmingHAJOS, AlfredHUNMen100m freestyleMGold
1Athens1896AquaticsSwimmingHERSCHMANN, OttoAUTMen100m freestyleMSilver
2Athens1896AquaticsSwimmingDRIVAS, DimitriosGREMen100m freestyle for sailorsMBronze
3Athens1896AquaticsSwimmingMALOKINIS, IoannisGREMen100m freestyle for sailorsMGold
4Athens1896AquaticsSwimmingCHASAPIS, SpiridonGREMen100m freestyle for sailorsMSilver
1
2
3
# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']
usa.head()
CityEditionSportDisciplineAthleteNOCGenderEventEvent_genderMedal
11Athens1896AthleticsAthleticsLANE, FrancisUSAMen100mMBronze
13Athens1896AthleticsAthleticsBURKE, ThomasUSAMen100mMGold
15Athens1896AthleticsAthleticsCURTIS, ThomasUSAMen110m hurdlesMGold
19Athens1896AthleticsAthleticsBLAKE, ArthurUSAMen1500mMSilver
21Athens1896AthleticsAthleticsBURKE, ThomasUSAMen400mMGold
1
2
3
# Group usa by ['Edition', 'Medal'] and aggregate over 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()
usa_medals_by_year.head()
1
2
3
4
5
6
7
Edition  Medal 
1896     Bronze     2
         Gold      11
         Silver     7
1900     Bronze    14
         Gold      27
Name: Athlete, dtype: int64
1
2
3
4
5
6
# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')

# Plot the DataFrame usa_medals_by_year
usa_medals_by_year.plot()
plt.show()

png

Visualizing USA Medal Counts by Edition: Area Plot

As in the previous exercise, your job in this exercise is to visualize the medal counts by ‘Edition’ for the USA. This time, you will use an area plot to see the breakdown better. The usa DataFrame has been created and all reshaping from the previous exercise has been done. You need to write the plotting command.

Instructions

  • Create an area plot of usa_medals_by_year. This can be done by using .plot.area().
1
2
3
4
5
6
7
8
9
10
11
12
# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']

# Group usa by 'Edition', 'Medal', and 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()

# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')

# Create an area plot of usa_medals_by_year
usa_medals_by_year.plot.area()
plt.show()

png

1
usa_medals_by_year.head()
MedalBronzeGoldSilver
Edition
18962117
1900142714
1904111146137
1908153414
1912314525
1
medals.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29216 entries, 0 to 29215
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   City          29216 non-null  object
 1   Edition       29216 non-null  int64 
 2   Sport         29216 non-null  object
 3   Discipline    29216 non-null  object
 4   Athlete       29216 non-null  object
 5   NOC           29216 non-null  object
 6   Gender        29216 non-null  object
 7   Event         29216 non-null  object
 8   Event_gender  29216 non-null  object
 9   Medal         29216 non-null  object
dtypes: int64(1), object(9)
memory usage: 2.2+ MB

Visualizing USA Medal Counts by Edition: Area Plot with Ordered Medals

You may have noticed that the medals are ordered according to a lexicographic (dictionary) ordering: Bronze < Gold < Silver. However, you would prefer an ordering consistent with the Olympic rules: Bronze < Silver < Gold.

You can achieve this using Categorical types. In this final exercise, after redefining the ‘Medal’ column of the DataFrame medals, you will repeat the area plot from the previous exercise to see the new ordering.

Instructions

  • Redefine the ‘Medal’ column of the DataFrame medals as an ordered categorical. To do this, use pd.Categorical() with three keyword arguments:
    • values = medals.Medal.
    • categories=[‘Bronze’, ‘Silver’, ‘Gold’].
    • ordered=True.
    • After this, you can verify that the type has changed using medals.info().
  • Plot the final DataFrame usa_medals_by_year as an area plot. This has been done for you, so hit ‘Submit Answer’ to see how the plot has changed!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Redefine 'Medal' as an ordered categorical
medals.Medal = pd.Categorical(values=medals.Medal,
                              categories=['Bronze', 'Silver', 'Gold'],
                              ordered=True)

# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']

# Group usa by 'Edition', 'Medal', and 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'], observed=False)['Athlete'].count()

# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')

# Create an area plot of usa_medals_by_year
usa_medals_by_year.plot.area()
plt.show()

png

1
medals.head(3)
CityEditionSportDisciplineAthleteNOCGenderEventEvent_genderMedal
0Athens1896AquaticsSwimmingHAJOS, AlfredHUNMen100m freestyleMGold
1Athens1896AquaticsSwimmingHERSCHMANN, OttoAUTMen100m freestyleMSilver
2Athens1896AquaticsSwimmingDRIVAS, DimitriosGREMen100m freestyle for sailorsMBronze
1
medals.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29216 entries, 0 to 29215
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   City          29216 non-null  object  
 1   Edition       29216 non-null  int64   
 2   Sport         29216 non-null  object  
 3   Discipline    29216 non-null  object  
 4   Athlete       29216 non-null  object  
 5   NOC           29216 non-null  object  
 6   Gender        29216 non-null  object  
 7   Event         29216 non-null  object  
 8   Event_gender  29216 non-null  object  
 9   Medal         29216 non-null  category
dtypes: category(1), int64(1), object(8)
memory usage: 2.0+ MB
1
usa_medals_by_year.head()
MedalBronzeSilverGold
Edition
18962711
1900141427
1904111137146
1908151434
1912312545

Final Thoughts

You can now…

  • Transform, extract, and filter data from DataFrames
  • Work with pandas indexes and hierarchical indexes
  • Reshape and restructure your data
  • Split your data into groups and categories
1
2
3
4
5
6
7
8
import sys

# These are the usual ipython objects, including this one you are creating
ipython_vars = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars']

# Get a sorted list of the objects and their sizes
sorted([(x, sys.getsizeof(globals().get(x))) for x in dir() if not x.startswith('_') 
        and x not in sys.modules and x not in ipython_vars], key=lambda x: x[1], reverse=True)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
[('medals', 15496232),
 ('ev_gen', 3462642),
 ('usa', 2330626),
 ('cold_war_medals', 2130475),
 ('gapminder', 1901820),
 ('country_names', 1753124),
 ('gapminder_mask', 675797),
 ('titanic', 603665),
 ('life', 95093),
 ('under10', 84022),
 ('auto', 74338),
 ('unemployment', 51907),
 ('gapminder_2010', 36404),
 ('medals_won_by_country', 30016),
 ('during_cold_war', 29380),
 ('france', 29380),
 ('is_usa_urs', 29380),
 ('sus', 29380),
 ('regions', 28636),
 ('reg_disp', 22404),
 ('election', 20286),
 ('standardized', 16557),
 ('outliers', 13527),
 ('countries', 13325),
 ('right_columns', 12797),
 ('counted', 12728),
 ('middle_columns', 11725),
 ('results', 11725),
 ('left_columns', 11469),
 ('medal_counts', 9416),
 ('aggregated', 8783),
 ('df_celsius', 6004),
 ('group', 5974),
 ('sales_values', 5904),
 ('turnout_zscore', 4876),
 ('high_turnout', 4407),
 ('too_close', 4407),
 ('sales', 2812),
 ('skinny', 1728),
 ('NamespaceMagics', 1688),
 ('high_turnout_df', 1673),
 ('more_trials', 1220),
 ('by_com_filt', 1205),
 ('stocks', 1202),
 ('weather', 1148),
 ('p_counties', 966),
 ('p_counties_rev', 966),
 ('france_medals', 832),
 ('usa_medals_by_year', 832),
 ('medal_count_by_gender', 774),
 ('df2', 756),
 ('kv_pairs', 744),
 ('gm_outliers', 737),
 ('users', 720),
 ('most_medals', 712),
 ('trials', 692),
 ('visitors', 688),
 ('newusers', 675),
 ('users_idx', 675),
 ('ev_gen_uniques', 666),
 ('sorted_trials', 665),
 ('stacked', 665),
 ('swapped', 665),
 ('df_new', 660),
 ('chevy_means', 632),
 ('all_month2', 627),
 ('suspect', 623),
 ('CA_TX_month2', 601),
 ('by_com_sum', 574),
 ('enough_salt_sold', 570),
 ('shares', 565),
 ('chevy', 556),
 ('units_sum', 544),
 ('count_mult', 540),
 ('df', 536),
 ('three_counties', 463),
 ('survived_mean_2', 460),
 ('all_medals', 448),
 ('sports', 448),
 ('customers', 409),
 ('NY_month1', 347),
 ('signups_and_visitors', 322),
 ('visitors_by_city_weekday', 316),
 ('new_trials', 312),
 ('election_penn', 303),
 ('survived_mean_1', 285),
 ('Nsports', 276),
 ('cold_war_usa_urs_medals', 272),
 ('signups_and_visitors_total', 260),
 ('weather_data', 255),
 ('medals_data', 228),
 ('byweekday', 222),
 ('by_city_day', 216),
 ('bycity', 216),
 ('pivot', 216),
 ('gapminder_data', 213),
 ('trials_by_gender', 212),
 ('auto_mpg', 210),
 ('titanic_data', 210),
 ('sales2_data', 204),
 ('sales_data', 201),
 ('users_data', 201),
 ('count_by_weekday1', 200),
 ('count_by_weekday2', 200),
 ('aggregator', 184),
 ('red_vs_blue', 184),
 ('signups_pivot', 184),
 ('visitors_pivot', 184),
 ('c_surv_by_sex', 172),
 ('c_deck_survival', 152),
 ('data_range', 152),
 ('disparity', 152),
 ('dozens', 152),
 ('impute_median', 152),
 ('open', 152),
 ('spread', 152),
 ('to_celsius', 152),
 ('zscore', 152),
 ('zscore_def', 152),
 ('zscore_with_year_and_name', 152),
 ('new_idx', 120),
 ('sales_feb', 108),
 ('days', 104),
 ('months', 104),
 ('prices', 104),
 ('massachusetts_labor', 103),
 ('cols', 88),
 ('rows', 88),
 ('sales_cols', 88),
 ('count_by_class', 80),
 ('np', 72),
 ('pd', 72),
 ('plt', 72),
 ('USA_edition_grouped', 56),
 ('by_class', 56),
 ('by_class_sub', 56),
 ('by_company', 56),
 ('by_day', 56),
 ('by_mult', 56),
 ('by_sex', 56),
 ('by_sex_class', 56),
 ('by_year_region', 56),
 ('country_grouped', 56),
 ('france_grps', 56),
 ('life_by_region', 56),
 ('medals_by_gender', 56),
 ('regional', 56),
 ('splitting', 56),
 ('avg', 32),
 ('group_name', 28),
 ('x', 28),
 ('y', 28),
 ('NaN', 24)]

Certificate

This post is licensed under CC BY 4.0 by the author.