import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy import NaN
from scipy.stats import zscore
pd.set_option('max_columns', 50)
pd.set_option('max_rows', 100)
pd.set_option('display.expand_frame_repr', True)
election_penn = 'data/manipulating-dataframes-with-pandas/2012_US_election_results_(Pennsylvania).csv'
gapminder_data = 'data/manipulating-dataframes-with-pandas/gapminder.csv'
medals_data = 'data/manipulating-dataframes-with-pandas/olympic_medals.csv'
weather_data = 'data/manipulating-dataframes-with-pandas/Pittsburgh_weather_data.csv'
sales_data = 'data/manipulating-dataframes-with-pandas/sales.csv'
sales2_data = 'data/manipulating-dataframes-with-pandas/sales2.csv'
sales_feb = 'data/manipulating-dataframes-with-pandas/sales-feb-2015.csv'
titanic_data = 'data/manipulating-dataframes-with-pandas/titanics.csv'
users_data = 'data/manipulating-dataframes-with-pandas/users.csv'
massachusetts_labor = 'data/manipulating-dataframes-with-pandas/LURReport.csv'
auto_mpg = 'data/manipulating-dataframes-with-pandas/auto-mpg.csv'
Course Description
In this course, you'll learn how to leverage pandas' extremely powerful data manipulation engine to get the most out of your data. It is important to be able to extract, filter, and transform data from DataFrames in order to drill into the data that really matters. The pandas library has many techniques that make this process efficient and intuitive. You will learn how to tidy, rearrange, and restructure your data by pivoting or melting and stacking or unstacking DataFrames. These are all fundamental next steps on the road to becoming a well-rounded Data Scientist, and you will have the chance to apply all the concepts you learn to real-world datasets.
In this chapter, you will learn all about how to index, slice, filter, and transform DataFrames, using a variety of datasets, ranging from 2012 US election data for the state of Pennsylvania to Pittsburgh weather data.
df = pd.read_csv(sales_data, index_col='month')
df
df['salt']['Jan']
df.eggs['Mar']
df.loc['May', 'spam']
df.iloc[4, 2]
df_new = df[['salt','eggs']]
df_new
In this exercise, the DataFrame election is provided for you. It contains the 2012 US election results for the state of Pennsylvania with county names as row indices. Your job is to select 'Bedford' county and the 'winner' column. Which method is the preferred way?
election = pd.read_csv(election_penn, index_col='county')
election.head()
election.loc['Bedford', 'winner']
Given a pair of label-based indices, sometimes it's necessary to find the corresponding positions. In this exercise, you will use the Pennsylvania election results again. The DataFrame is provided for you as election.
Find x and y such that election.iloc[x, y] == election.loc['Bedford', 'winner']. That is, what is the row position of 'Bedford', and the column position of 'winner'? Remember that the first position in Python is 0, not 1!
To answer this question, first explore the DataFrame using election.head() in the IPython Shell and inspect it with your eyes.
Instructions
# Assign the row position of election.loc['Bedford']: x
x = 4
# Assign the column position of election['winner']: y
y = 4
# Print the boolean equivalence
print(election.iloc[x, y] == election.loc['Bedford', 'winner'])
Depending on the situation, you may wish to use .iloc[] over .loc[], and vice versa. The important thing to realize is you can achieve the exact same results using either approach.
There are circumstances in which it's useful to modify the order of your DataFrame columns. We do that now by extracting just two columns from the Pennsylvania election results DataFrame.
Your job is to read the CSV file and set the index to 'county'. You'll then assign a new DataFrame by selecting the list of columns ['winner', 'total', 'voters']. The CSV file is provided to you in the variable filename.
Instructions
# Read in filename and set the index: election
election = pd.read_csv(election_penn, index_col='county')
# Create a separate dataframe with the columns ['winner', 'total', 'voters']: results
results = election[['winner', 'total', 'voters']]
# Print the output of results.head(['winner', 'total', 'voters'])
print(results.head())
The original election DataFrame had 6 columns, but as you can see, your results DataFrame now has just the 3 columns: 'winner', 'total', and 'voters'.
df = pd.read_csv(sales_data, index_col='month')
df
df['eggs']
type(df.eggs)
df['eggs'][1:4] # Part of the eggs column
df['eggs'][4] # The value associated with May
df.loc[:, 'eggs':'salt'] # All rows, some columns
df.loc['Jan':'Apr',:] # Some rows, all columns
df.loc['Mar':'May', 'salt':'spam']
df.iloc[2:5, 1:] # A block from middle of the DataFrame
df.loc['Jan':'May', ['eggs', 'spam']]
df.iloc[[0,4,5], 0:2]
# A Series by column name
df['eggs']
type(df['eggs'])
# A DataFrame w/ single column
df[['eggs']]
type(df[['eggs']])
The Pennsylvania US election results data set that you have been using so far is ordered by county name. This means that county names can be sliced alphabetically. In this exercise, you're going to perform slicing on the county names of the election DataFrame from the previous exercises, which has been pre-loaded for you.
Instructions
# Slice the row labels 'Perry' to 'Potter': p_counties
p_counties = election.loc['Perry':'Potter']
# Print the p_counties DataFrame
print(p_counties)
# Slice the row labels 'Potter' to 'Perry' in reverse order: p_counties_rev
p_counties_rev = election.loc['Potter':'Perry':-1]
# Print the p_counties_rev DataFrame
print(p_counties_rev)
Similar to row slicing, columns can be sliced by value. In this exercise, your job is to slice column names from the Pennsylvania election results DataFrame using .loc[].
It has been pre-loaded for you as election, with the index set to 'county'.
Instructions
# Slice the columns from the starting column to 'Obama': left_columns
left_columns = election.loc[:, :'Obama']
# Print the output of left_columns.head()
print('Left Columns: \n', left_columns.head())
# Slice the columns from 'Obama' to 'winner': middle_columns
middle_columns = election.loc[:, 'Obama':'winner']
# Print the output of middle_columns.head()
print('\nMiddle Columns: \n', middle_columns.head())
# Slice the columns from 'Romney' to the end: 'right_columns'
right_columns = election.loc[:, 'Romney':]
# Print the output of right_columns.head()
print('\nRight Columns: \n', right_columns.head())
You can use lists to select specific row and column labels with the .loc[] accessor. In this exercise, your job is to select the counties ['Philadelphia', 'Centre', 'Fulton'] and the columns ['winner','Obama','Romney'] from the election DataFrame, which has been pre-loaded for you with the index set to 'county'.
Instructions
# Create the list of row labels: rows
rows = ['Philadelphia', 'Centre', 'Fulton']
# Create the list of column labels: cols
cols = ['winner', 'Obama', 'Romney']
# Create the new DataFrame: three_counties
three_counties = election.loc[rows, cols]
# Print the three_counties DataFrame
print(three_counties)
If you know exactly which rows and columns are of interest to you, this is a useful approach for subselecting DataFrames.
df = pd.read_csv(sales_data, index_col='month')
df
df.salt > 60
df[df.salt > 60]
enough_salt_sold = df.salt > 60
df[enough_salt_sold]
df[(df.salt >= 50) & (df.eggs < 200)] # Both conditions
df[(df.salt >= 50) | (df.eggs < 200)] # Either condition
df2 = df.copy()
df2['bacon'] = [0, 0, 50, 60, 70, 80]
df2
df2.loc[:, df2.all()]
df2.loc[:, df2.any()]
df.loc[:, df.isnull().any()]
df.loc[:, df.notnull().all()]
df.dropna(how='any')
df.eggs[df.salt > 55]
df.eggs[df.salt > 55] += 5
df
In this exercise, we have provided the Pennsylvania election results and included a column called 'turnout' that contains the percentage of voter turnout per county. Your job is to prepare a boolean array to select all of the rows and columns where voter turnout exceeded 70%.
As before, the DataFrame is available to you as election with the index set to 'county'.
Instructions
election = pd.read_csv(election_penn, index_col='county')
# Create the boolean array: high_turnout
high_turnout = election.turnout > 70
# Filter the election DataFrame with the high_turnout array: high_turnout_df
high_turnout_df = election[high_turnout]
# Print the high_turnout_results DataFrame
high_turnout_df
The election results DataFrame has a column labeled 'margin' which expresses the number of extra votes the winner received over the losing candidate. This number is given as a percentage of the total votes cast. It is reasonable to assume that in counties where this margin was less than 1%, the results would be too-close-to-call.
Your job is to use boolean selection to filter the rows where the margin was less than 1. You'll then convert these rows of the 'winner' column to np.nan to indicate that these results are too close to declare a winner.
The DataFrame has been pre-loaded for you as election.
Instructions
# Create the boolean array: too_close
too_close = election.margin < 1
# Assign np.nan to the 'winner' column where the results were too close to call
election.winner[too_close] = NaN
# Print the output of election.info()
election.info()
In certain scenarios, it may be necessary to remove rows and columns with missing data from a DataFrame. The .dropna() method is used to perform this action. You'll now practice using this method on a dataset obtained from Vanderbilt University, which consists of data from passengers on the Titanic.
The DataFrame has been pre-loaded for you as titanic. Explore it in the IPython Shell and you will note that there are many NaNs. You will focus specifically on the 'age' and 'cabin' columns in this exercise. Your job is to use .dropna() to remove rows where any of these two columns contains missing data and rows where all of these two columns contain missing data.
You'll also use the .shape attribute, which returns the number of rows and columns in a tuple from a DataFrame, or the number of rows from a Series, to see the effect of dropping missing values from a DataFrame.
Finally, you'll use the thresh= keyword argument to drop columns from the full dataset that have less than 1000 non-missing values.
Instructions
titanic = pd.read_csv(titanic_data)
# Select the 'age' and 'cabin' columns: df
df = titanic[['age', 'cabin']]
# Print the shape of df
print(df.shape)
# Drop rows in df with how='any' and print the shape
print('\n', df.dropna(how='any').shape)
# Drop rows in df with how='all' and print the shape
print('\n', df.dropna(how='all').shape)
# Drop columns in titanic with less than 1000 non-missing values
print('\n', titanic.dropna(thresh=1000, axis='columns').info())
df = pd.read_csv(sales_data, index_col='month')
df
df.floordiv(12) # Convert to dozens unit
np.floor_divide(df, 12) # Convert to dozens unit
def dozens(n):
return n//12
df.apply(dozens) # Convert to dozens unit
df.apply(lambda n: n//12)
df['dozens_of_eggs'] = df.eggs.floordiv(12)
df
df.index
df.index = df.index.str.upper()
df
df.index = df.index.map(str.lower)
df
df['salty_eggs'] = df.salt + df.dozens_of_eggs
df
The .apply() method can be used on a pandas DataFrame to apply an arbitrary Python function to every element. In this exercise you'll take daily weather data in Pittsburgh in 2013 obtained from Weather Underground.
A function to convert degrees Fahrenheit to degrees Celsius has been written for you. Your job is to use the .apply() method to perform this conversion on the 'Mean TemperatureF' and 'Mean Dew PointF' columns of the weather DataFrame.
Instructions
weather = pd.read_csv(weather_data)
# Write a function to convert degrees Fahrenheit to degrees Celsius: to_celsius
def to_celsius(F):
return 5/9*(F - 32)
# Apply the function over 'Mean TemperatureF' and 'Mean Dew PointF': df_celsius
df_celsius = weather[['Mean TemperatureF', 'Mean Dew PointF']].apply(to_celsius)
df_celsius.head()
# Reassign the columns df_celsius
df_celsius.columns = ['Mean TemperatureC', 'Mean Dew PointC']
df_celsius.head()
The .map() method is used to transform values according to a Python dictionary look-up. In this exercise you'll practice this method while returning to working with the election DataFrame, which has been pre-loaded for you.
Your job is to use a dictionary to map the values 'Obama' and 'Romney' in the 'winner' column to the values 'blue' and 'red', and assign the output to the new column 'color'.
Instructions
election = pd.read_csv(election_penn, index_col='county')
election.head()
# Create the dictionary: red_vs_blue
red_vs_blue = {'Obama':'blue', 'Romney':'red'}
# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election.winner.map(red_vs_blue)
# Print the output of election.head()
election.head()
When performance is paramount, you should avoid using .apply() and .map() because those constructs perform Python for-loops over the data stored in a pandas Series or DataFrame. By using vectorized functions instead, you can loop over the data at the same speed as compiled code (C, Fortran, etc.)! NumPy, SciPy and pandas come with a variety of vectorized functions (called Universal Functions or UFuncs in NumPy).
You can even write your own vectorized functions, but for now we will focus on the ones distributed by NumPy and pandas.
In this exercise you're going to import the zscore function from scipy.stats and use it to compute the deviation in voter turnout in Pennsylvania from the mean in fractions of the standard deviation. In statistics, the z-score is the number of standard deviations by which an observation is above the mean - so if it is negative, it means the observation is below the mean.
Instead of using .apply() as you did in the earlier exercises, the zscore UFunc will take a pandas Series as input and return a NumPy array. You will then assign the values of the NumPy array to a new column in the DataFrame. You will be working with the election DataFrame - it has been pre-loaded for you.
Instructions
# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(election.turnout)
# Print the type of turnout_zscore
print('Type: \n', type(turnout_zscore), '\n')
# Assign turnout_zscore to a new column: election['turnout_zscore']
election['turnout_zscore'] = turnout_zscore
# Print the output of election.head()
election.head()
Having learned the fundamentals of working with DataFrames, you will now move on to more advanced indexing techniques. You will learn about MultiIndexes, or hierarchical indexes, and learn how to interact with and extract data from them.
prices = [10.70, 10.86, 10.74, 10.71, 10.79]
shares = pd.Series(prices)
shares
days = ['Mon', 'Tue', 'Wed', 'Thur', 'Fri']
shares = pd.Series(prices, index=days)
shares
print(shares.index)
print(shares.index[2])
print(shares.index[:2])
print(shares.index[-2:])
print(shares.index.name)
shares.index.name = 'weekday'
shares
try:
shares.index[2] = 'Wednesday'
except TypeError:
print('TypeError: Index does not support mutable operations')
try:
shares.index[:4]= ['Monday', 'Tuesday', 'Wednesday', 'Thursday']
except TypeError:
print('TypeError: Index does not support mutable operations')
shares.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
shares
unemployment = pd.read_csv('data/manipulating-dataframes-with-pandas/LURReport.csv', parse_dates=[['Year', 'Month']])
unemployment.head()
unemployment.info()
unemployment.index = unemployment.Area
unemployment.head()
del unemployment['Area']
unemployment.head()
print('Index: \n', unemployment.index)
print('\nIndex Name:\n', unemployment.index.name)
print('\nIndex Type:\n', type(unemployment.index))
print('\nDataFrame Columns\n', unemployment.columns)
unemployment = pd.read_csv('data/manipulating-dataframes-with-pandas/LURReport.csv',
parse_dates=[['Year', 'Month']],
index_col='Area')
unemployment.head()
df = pd.read_csv(sales_data, index_col='month')
df
Which one of the following index operations does not raise an error?
The sales DataFrame which you have seen in the videos of the previous chapter has been pre-loaded for you and is available for exploration in the IPython Shell.
Instructions
Possible Answers
As you saw in the previous exercise, indexes are immutable objects. This means that if you want to change or modify the index in a DataFrame, then you need to change the whole index. You will do this now, using a list comprehension to create the new index.
A list comprehension is a succinct way to generate a list in one line. For example, the following list comprehension generates a list that contains the cubes of all numbers from 0 to 9:
cubes = [i**3 for i in range(10)]
This is equivalent to the following code:
cubes = []
for i in range(10):
cubes.append(i**3)
Before getting started, print the sales DataFrame in the IPython Shell and verify that the index is given by month abbreviations containing lowercase characters.
Instructions
# Create the list of new indexes: new_idx
new_idx = [x.upper() for x in df.index]
# Assign new_idx to sales.index
df.index = new_idx
# Print the sales DataFrame
print(df)
Notice that in the previous exercise, the index was not labeled with a name. In this exercise, you will set its name to 'MONTHS'.
Similarly, if all the columns are related in some way, you can provide a label for the set of columns.
To get started, print the sales DataFrame in the IPython Shell and verify that the index has no name, only its data (the month names).
Instructions
# Assign the string 'MONTHS' to sales.index.name
df.index.name = 'MONTHS'
# Print the sales DataFrame
print(df)
# Assign the string 'PRODUCTS' to sales.columns.name
df.columns.name = 'PRODUCTS'
# Print the sales dataframe again
print('\n', df)
You can also build the DataFrame and index independently, and then put them together. If you take this route, be careful, as any mistakes in generating the DataFrame or the index can cause the data and the index to be aligned incorrectly.
In this exercise, the sales DataFrame has been provided for you without the month index. Your job is to build this index separately and then assign it to the sales DataFrame. Before getting started, print the sales DataFrame in the IPython Shell and note that it's missing the month information.
Instructions
df = pd.read_csv(sales_data, usecols=['eggs', 'salt', 'spam'])
df
# Generate the list of months: months
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
# Assign months to sales.index
df.index = months
# Print the modified sales DataFrame
print(df)
stocks = pd.DataFrame([['2016-10-03', 31.50, 14070500, 'CSCO'],
['2016-10-03', 112.52, 21701800, 'AAPL'],
['2016-10-03', 57.42, 19189500, 'MSFT'],
['2016-10-04', 113.00, 29736800, 'AAPL'],
['2016-10-04', 57.24, 20085900, 'MSFT'],
['2016-10-04', 31.35, 18460400, 'CSCO'],
['2016-10-05', 57.64, 16726400, 'MSFT'],
['2016-10-05', 31.59, 11808600, 'CSCO'],
['2016-10-05', 113.05, 21453100, 'AAPL']],
columns=['Date', 'Close', 'Volume', 'Symbol'])
stocks
stocks = stocks.set_index(['Symbol', 'Date'])
stocks
stocks.index
print(stocks.index.name)
print(stocks.index.names)
stocks = stocks.sort_index()
stocks
stocks.loc[('CSCO', '2016-10-04')]
stocks.loc[('CSCO', '2016-10-04'), 'Volume']
stocks.loc['AAPL']
stocks.loc['CSCO':'MSFT']
stocks.loc[(['AAPL', 'MSFT'], '2016-10-05'), :]
stocks.loc[(['AAPL', 'MSFT'], '2016-10-05'), 'Close']
stocks.loc[('CSCO', ['2016-10-05', '2016-10-03']), :]
stocks.loc[(slice(None), slice('2016-10-03', '2016-10-04')),:]
sales = pd.read_csv(sales2_data, index_col=['state', 'month'])
sales
In the video, Dhavide explained the concept of a hierarchical index, or a MultiIndex. You will now practice working with these types of indexes.
The sales DataFrame you have been working with has been extended to now include State information as well. In the IPython Shell, print the new sales DataFrame to inspect the data. Take note of the MultiIndex!
Extracting elements from the outermost level of a MultiIndex is just like in the case of a single-level Index. You can use the .loc[] accessor as Dhavide demonstrated in the video.
Instructions
# Print sales.loc[['CA', 'TX']]
print(sales.loc[['CA', 'TX']])
# Print sales['CA':'TX']
print('\n', sales['CA':'TX'])
Notice how New York is excluded by the first operation, and included in the second one.
In the previous exercise, the MultiIndex was created and sorted for you. Now, you're going to do this yourself! With a MultiIndex, you should always ensure the index is sorted. You can skip this only if you know the data is already sorted on the index fields.
To get started, print the pre-loaded sales DataFrame in the IPython Shell to verify that there is no MultiIndex.
Instructions
sales = pd.read_csv(sales2_data)
# Set the index to be the columns ['state', 'month']: sales
sales = sales.set_index(['state', 'month'])
# Sort the MultiIndex: sales
sales = sales.sort_index()
# Print the sales DataFrame
sales
As Dhavide mentioned in the video, it is always preferable to have a meaningful index that uniquely identifies each row. Even though pandas does not require unique index values in DataFrames, it works better if the index values are indeed unique. To see an example of this, you will index your sales data by 'state' in this exercise.
As always, begin by printing the sales DataFrame in the IPython Shell and inspecting it.
Instructions
sales = pd.read_csv(sales2_data)
# Set the index to the column 'state': sales
sales = sales.set_index('state')
# Print the sales DataFrame
print(sales)
# Access the data from 'NY'
print('\n', sales.loc['NY'])
Looking up indexed data is fast and efficient. And you have already seen that lookups based on the outermost level of a MultiIndex work just like lookups on DataFrames that have a single-level Index.
Looking up data based on inner levels of a MultiIndex can be a bit trickier. In this exercise, you will use your sales DataFrame to do some increasingly complex lookups.
The trickiest of all these lookups are when you want to access some inner levels of the index. In this case, you need to use slice(None) in the slicing parameter for the outermost dimension(s) instead of the usual :, or use pd.IndexSlice. You can refer to the pandas documentation for more details. For example, in the video, Dhavide used the following code to extract rows from all Symbols for the dates Oct. 3rd through 4th inclusive:
stocks.loc[(slice(None), slice('2016-10-03', '2016-10-04')), :]
Pay particular attention to the tuple:
(slice(None), slice('2016-10-03', '2016-10-04')).
Instructions
sales = pd.read_csv(sales2_data, index_col=['state', 'month'])
sales
# Look up data for NY in month 1: NY_month1
NY_month1 = sales.loc[('NY', 1), :]
NY_month1
# Look up data for CA and TX in month 2: CA_TX_month2
CA_TX_month2 = sales.loc[(('CA', 'TX'), 2), :]
CA_TX_month2
# Look up data for all states in month 2: all_month2
all_month2 = sales.loc[(slice(None), 2), :]
all_month2
Here, you will learn how to reshape your DataFrames using techniques such as pivoting, melting, stacking, and unstacking. These are powerful techniques that allow you to tidy and rearrange your data into the format that allows you to most easily analyze it for insights.
trials = pd.DataFrame([[1, 'A', 'F', 5],
[2, 'A', 'M', 3],
[3, 'B', 'F', 8],
[4, 'B', 'M', 9]],
columns=['id', 'treatment', 'gender', 'response'])
trials
trials.pivot(index='treatment',
columns='gender',
values='response')
trials.pivot(index='treatment', columns='gender')
Prior to using .pivot(), you need to set the index of the DataFrame somehow. Is this statement True or False?
Answer the question
False
Suppose you started a blog for a band, and you would like to log how many visitors you have had, and how many signed-up for your newsletter. To help design the tours later, you track where the visitors are. A DataFrame called users consisting of this information has been pre-loaded for you.
Inspect users in the IPython Shell and make a note of which variable you want to use to index the rows ('weekday'), which variable you want to use to index the columns ('city'), and which variable will populate the values in the cells ('visitors'). Try to visualize what the result should be.
For example, in the video, Dhavide used 'treatment' to index the rows, 'gender' to index the columns, and 'response' to populate the cells. Prior to pivoting, the DataFrame looked like this:
id treatment gender response
0 1 A F 5
1 2 A M 3
2 3 B F 8
3 4 B M 9
After pivoting:
gender F M
treatment
A 5 3
B 8 9
In this exercise, your job is to pivot users so that the focus is on 'visitors', with the columns indexed by 'city' and the rows indexed by 'weekday'.
Instructions
users = pd.read_csv(users_data)
users
# Pivot the users DataFrame: visitors_pivot
visitors_pivot = users.pivot(index='weekday',
columns='city',
values='visitors')
visitors_pivot
Notice how in the pivoted DataFrame, the index is labeled 'weekday', the columns are labeled 'city', and the values are populated by the number of visitors.
If you do not select any particular variables, all of them will be pivoted. In this case - with the users DataFrame - both 'visitors' and 'signups' will be pivoted, creating hierarchical column labels.
You will explore this for yourself now in this exercise.
Instructions
# Pivot users with signups indexed by weekday and city: signups_pivot
signups_pivot = users.pivot(index='weekday', columns='city', values='signups')
signups_pivot
# Pivot users pivoted by both signups and visitors: pivot
pivot = users.pivot(index='weekday', columns='city')
pivot
Notice how in the second DataFrame, both 'signups' and 'visitors' were pivoted by default since you didn't provide an argument for the values parameter.
trials
trials = trials.set_index(['treatment', 'gender'])
trials
trials.unstack(level='gender')
trials
trials.unstack(level=1)
trials_by_gender = trials.unstack(level='gender')
trials_by_gender
trials_by_gender.stack(level='gender')
stacked = trials_by_gender.stack(level='gender')
stacked
swapped = stacked.swaplevel(0, 1)
swapped
sorted_trials = swapped.sort_index()
sorted_trials
You are now going to practice stacking and unstacking DataFrames. The users DataFrame you have been working with in this chapter has been pre-loaded for you, this time with a MultiIndex. Explore it in the IPython Shell to see the data layout. Pay attention to the index, and notice that the index levels are ['city', 'weekday']. So 'weekday' - the second entry - has position 1. This position is what corresponds to the level parameter in .stack() and .unstack() calls. Alternatively, you can specify 'weekday' as the level instead of its position.
Your job in this exercise is to unstack users by 'weekday'. You will then use .stack() on the unstacked DataFrame to see if you get back the original layout of users.
Instructions
users = pd.read_csv(users_data)
users.set_index(['city', 'weekday'], inplace=True)
users = users.sort_index()
users
# Unstack users by 'weekday': byweekday
byweekday = users.unstack(level='weekday')
byweekday
# Stack byweekday by 'weekday' and print it
byweekday.stack(level='weekday')
You are now going to continue working with the users DataFrame. As always, first explore it in the IPython Shell to see the layout and note the index.
Your job in this exercise is to unstack and then stack the 'city' level, as you did previously for 'weekday'. Note that you won't get the same DataFrame.
Instructions
users
# Unstack users by 'city': bycity
bycity = users.unstack(level='city')
bycity
# Stack bycity by 'city' and print it
bycity.stack(level='city')
Continuing from the previous exercise, you will now use .swaplevel(0, 1) to flip the index levels. Note they won't be sorted. To sort them, you will have to follow up with a .sort_index(). You will then obtain the original DataFrame. Note that an unsorted index leads to slicing failures.
To begin, print both users and bycity in the IPython Shell. The goal here is to convert bycity back to something that looks like users.
Instructions
bycity
# Stack 'city' back into the index of bycity: newusers
newusers = bycity.stack(level='city')
newusers
# Swap the levels of the index of newusers: newusers
newusers = newusers.swaplevel(0, 1)
newusers
# Sort the index of newusers: newusers
newusers = newusers.sort_index()
newusers
# Verify that the new DataFrame is equal to the original
newusers.equals(users)
trials = pd.DataFrame([[1, 'A', 'F', 5],
[2, 'A', 'M', 3],
[3, 'B', 'F', 8],
[4, 'B', 'M', 9]],
columns=['id', 'treatment', 'gender', 'response'])
trials
trials.pivot(index='treatment',
columns='gender',
values='response')
new_trials = pd.DataFrame([['A', 5, 3],
['B', 8, 9],],
columns=['treatment', 'F', 'M'])
new_trials
pd.melt(new_trials)
pd.melt(new_trials, id_vars=['treatment'])
pd.melt(new_trials, id_vars=['treatment'],
value_vars=['F', 'M'])
pd.melt(new_trials, id_vars=['treatment'],
var_name='gender', value_name='response')
You are now going to practice melting DataFrames. A DataFrame called visitors_by_city_weekday has been pre-loaded for you. Explore it in the IPython Shell and see that it is the users DataFrame from previous exercises with the rows indexed by 'weekday', columns indexed by 'city', and values populated with 'visitors'.
Recall from the video that the goal of melting is to restore a pivoted DataFrame to its original form, or to change it from a wide shape to a long shape. You can explicitly specify the columns that should remain in the reshaped DataFrame with id_vars, and list which columns to convert into values with value_vars. As Dhavide demonstrated, if you don't pass a name to the values in d.melt(), you will lose the name of your variable. You can fix this by using the value_name keyword argument.
Your job in this exercise is to melt visitors_by_city_weekday to move the city names from the column labels to values in a single column called 'city'. If you were to use just pd.melt(visitors_by_city_weekday), you would obtain the following result:
city value
0 weekday Mon
1 weekday Sun
2 Austin 326
3 Austin 139
4 Dallas 456
5 Dallas 237
Therefore, you have to specify the id_vars keyword argument to ensure that 'weekday' is retained in the reshaped DataFrame, and the value_name keyword argument to change the name of value to visitors.
Instructions
# Create the DataFrame for the exercise
visitors = pd.DataFrame({'weekday': ['Mon', 'Sun', 'Mon', 'Sun'],
'city': ['Austin', 'Austin', 'Dallas', 'Dallas'],
'visitors': [326, 139, 456, 237]})
visitors
# Reshape the DataFrame for the exercise
visitors_by_city_weekday = visitors.pivot(index='weekday',
columns='city',
values='visitors')
visitors_by_city_weekday
# Reset the index: visitors_by_city_weekday
visitors_by_city_weekday = visitors_by_city_weekday.reset_index()
visitors_by_city_weekday
# Melt visitors_by_city_weekday: visitors
visitors = pd.melt(visitors_by_city_weekday, id_vars=['weekday'], value_name='visitors')
visitors
Notice how the melted DataFrame now has a 'city' column with Austin and Dallas as its values. In the original DataFrame, they were columns themselves. Also note how specifying the value_name parameter has renamed the 'value' column to 'visitors'.
You can move multiple columns into a single column (making the data long and skinny) by "melting" multiple columns. In this exercise, you will practice doing this.
The users DataFrame has been pre-loaded for you. As always, explore it in the IPython Shell and note the index.
Instructions
users = pd.read_csv(users_data)
users
# Melt users: skinny
skinny = users.melt(id_vars=['weekday', 'city'])
skinny
Because var_name or value_name parameters weren't specified, the melted DataFrame has the default variable and value column names.
Sometimes, all you need is some key-value pairs, and the context does not matter. If said context is in the index, you can easily obtain what you want. For example, in the users DataFrame, the visitors and signups columns lend themselves well to being represented as key-value pairs. So if you created a hierarchical index with 'city' and 'weekday' columns as the index, you can easily extract key-value pairs for the 'visitors' and 'signups' columns by melting users and specifying col_level=0.
Instructions
users
# Set the new index: users_idx
users_idx = users.set_index(['city', 'weekday'])
users_idx
# Obtain the key-value pairs: kv_pairs
kv_pairs = users_idx.melt(col_level=0)
kv_pairs
more_trials = pd.DataFrame([[1, 'A', 'F', 5],
[2, 'A', 'M', 3],
[3, 'A', 'M', 8],
[4, 'A', 'F', 9],
[5, 'B', 'F', 1],
[6, 'B', 'M', 8],
[7, 'B', 'F', 4],
[8, 'B', 'F', 6]],
columns=['id', 'treatment', 'gender', 'response'])
more_trials
try:
more_trials.pivot(index='treatment',
columns='gender',
values='response')
except ValueError:
print('ValueError: Index contains duplicate entries, cannot reshape')
more_trials.pivot_table(index='treatment',
columns='gender',
values='response')
more_trials.pivot_table(index='treatment',
columns='gender',
values='response',
aggfunc='count')
Recall from the video that a pivot table allows you to see all of your variables as a function of two other variables. In this exercise, you will use the .pivot_table() method to see how the users DataFrame entries appear when presented as functions of the 'weekday' and 'city' columns. That is, with the rows indexed by 'weekday' and the columns indexed by 'city'.
Before using the pivot table, print the users DataFrame in the IPython Shell and observe the layout.
Instructions
users = pd.read_csv(users_data)
# Create the DataFrame with the appropriate pivot table: by_city_day
by_city_day = users.pivot_table(index='weekday',
columns='city')
by_city_day
Notice the labels of the index and the columns are 'weekday' and 'city', respectively - exactly as you specified.
You can also use aggregation functions within a pivot table by specifying the aggfunc parameter. In this exercise, you will practice using the 'count' and len aggregation functions - which produce the same result - on the users DataFrame.
Instructions
# Use a pivot table to display the count of each column: count_by_weekday1
count_by_weekday1 = users.pivot_table(index='weekday',
aggfunc='count')
count_by_weekday1
# Replace 'aggfunc='count'' with 'aggfunc=len': count_by_weekday2
count_by_weekday2 = users.pivot_table(index='weekday',
aggfunc=len)
count_by_weekday2
# Verify that the same result is obtained
count_by_weekday1.equals(count_by_weekday2)
Sometimes it's useful to add totals in the margins of a pivot table. You can do this with the argument margins=True. In this exercise, you will practice using margins in a pivot table along with a new aggregation function: sum.
The users DataFrame, which you are now probably very familiar with, has been pre-loaded for you.
Instructions
# Create the DataFrame with the appropriate pivot table: signups_and_visitors
signups_and_visitors = users.pivot_table(index='weekday',
aggfunc=sum)
signups_and_visitors
# Add in the margins: signups_and_visitors_total
signups_and_visitors_total = users.pivot_table(index='weekday',
aggfunc=sum,
margins=True)
signups_and_visitors_total
Specifying margins=True resulted in the totals in each column being computed.
In this chapter, you'll learn how to identify and split DataFrames by groups or categories for further aggregation or analysis. You'll also learn how to transform and filter your data, including how to detect outliers and impute missing values. Knowing how to effectively group data in pandas can be a seriously powerful addition to your data science toolbox.
sales = pd.DataFrame(
{
'weekday': ['Sun', 'Sun', 'Mon', 'Mon'],
'city': ['Austin', 'Dallas', 'Austin', 'Dallas'],
'bread': [139, 237, 326, 456],
'butter': [20, 45, 70, 98]
}
)
sales
sales.loc[sales['weekday'] == 'Sun'].count()
sales.groupby('weekday').count()
sales.groupby('weekday')['bread'].sum()
sales
sales.groupby('weekday')[['bread','butter']].sum()
sales.groupby(['city','weekday']).mean()
customers = pd.Series(['Dave','Alice','Bob','Alice'])
customers
sales.groupby(customers)['bread'].sum()
sales['weekday'].unique()
sales['weekday'] = sales['weekday'].astype('category')
sales['weekday']
What are the main advantages of storing data explicitly as categorical types instead of object types?
Answer the question
In this exercise, you will return to working with the Titanic dataset from Chapter 1 and use .groupby() to analyze the distribution of passengers who boarded the Titanic.
The 'pclass' column identifies which class of ticket was purchased by the passenger and the 'embarked' column indicates at which of the three ports the passenger boarded the Titanic. 'S' stands for Southampton, England, 'C' for Cherbourg, France and 'Q' for Queenstown, Ireland.
Your job is to first group by the 'pclass' column and count the number of rows in each class using the 'survived' column. You will then group by the 'embarked' and 'pclass' columns and count the number of passengers.
The DataFrame has been pre-loaded as titanic.
Instructions
Titanic Data
titanic = pd.read_csv(titanic_data)
titanic.head(3)
# Group titanic by 'pclass'
by_class = titanic.groupby('pclass')
# Aggregate 'survived' column of by_class by count
count_by_class = by_class['survived'].count()
count_by_class
# Group titanic by 'embarked' and 'pclass'
by_mult = titanic.groupby(['embarked', 'pclass'])
# Aggregate 'survived' column of by_mult by count
count_mult = by_mult['survived'].count()
count_mult
Grouping data by certain columns and aggregating them by another column, in this case, 'survived', allows for carefull examination of the data for interesting insights.
In this exercise, you'll use two data sets from Gapminder.org to investigate the average life expectancy (in years) at birth in 2010 for the 6 continental regions. To do this you'll read the life expectancy data per country into one pandas DataFrame and the association between country and region into another.
By setting the index of both DataFrames to the country name, you'll then use the region information to group the countries in the life expectancy DataFrame and compute the mean value for 2010.
The life expectancy CSV file is available to you in the variable life_fname and the regions filename is available in the variable regions_fname.
Instructions
# Read life_fname into a DataFrame: life
life = pd.read_csv(gapminder_data, usecols=['Year', 'life', 'Country'])
life = life.pivot(index='Country',
columns='Year',
values='life')
life.head()
regions = pd.read_csv(gapminder_data, usecols=['region', 'Country'])
regions.drop_duplicates(subset='Country', inplace=True)
regions.reset_index(inplace=True, drop=True)
regions.set_index('Country', inplace=True)
regions.head()
# Group life by regions['region']: life_by_region
life_by_region = life.groupby(regions['region'])
# Print the mean over the '2010' column of life_by_region
life_by_region[2010].mean()
life_by_region.mean()
life_by_region[[1964, 1970, 1980, 1990, 2000, 2010, 2013]].mean().plot(rot=45)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
sales
sales.groupby('city')[['bread','butter']].max()
sales.groupby('city')[['bread','butter']].agg(['max','sum'])
def data_range(series):
return series.max() - series.min()
sales.groupby('weekday')[['bread', 'butter']].agg(data_range)
sales.groupby(customers)[['bread', 'butter']].agg({'bread':'sum', 'butter':data_range})
The .agg() method can be used with a tuple or list of aggregations as input. When applying multiple aggregations on multiple columns, the aggregated DataFrame has a multi-level column index.
In this exercise, you're going to group passengers on the Titanic by 'pclass' and aggregate the 'age' and 'fare' columns by the functions 'max' and 'median'. You'll then use multi-level selection to find the oldest passenger per class and the median fare price per class.
The DataFrame has been pre-loaded as titanic.
Instructions
# Group titanic by 'pclass': by_class
by_class = titanic.groupby('pclass')
# Select 'age' and 'fare'
by_class_sub = by_class[['age','fare']]
# Aggregate by_class_sub by 'max' and 'median': aggregated
aggregated = by_class_sub.agg(['max', 'median'])
print(aggregated)
# Print the maximum age in each class
print('\nMaximum Age:\n', aggregated.loc[:, ('age','max')])
# Print the median fare in each class
print('\nMedian Fare:\n', aggregated.loc[:, ('fare','median')])
If you have a DataFrame with a multi-level row index, the individual levels can be used to perform the groupby. This allows advanced aggregation techniques to be applied along one or more levels in the index and across one or more columns.
In this exercise you'll use the full Gapminder dataset which contains yearly values of life expectancy, population, child mortality (per 1,000) and per capita gross domestic product (GDP) for every country in the world from 1964 to 2013.
Your job is to create a multi-level DataFrame of the columns 'Year', 'Region' and 'Country'. Next you'll group the DataFrame by the 'Year' and 'Region' levels. Finally, you'll apply a dictionary aggregation to compute the total population, spread of per capita GDP values and average child mortality rate.
The Gapminder CSV file is available as 'gapminder.csv'.
Instructions
# Read the CSV file into a DataFrame and sort the index: gapminder
gapminder = pd.read_csv(gapminder_data, index_col=['Year', 'region', 'Country'])
gapminder.sort_index(inplace=True)
gapminder.head()
# Group gapminder by 'Year' and 'region': by_year_region
by_year_region = gapminder.groupby(level=['Year', 'region'])
# Define the function to compute spread: spread
def spread(series):
return series.max() - series.min()
# Create the dictionary: aggregator
aggregator = {'population':'sum', 'child_mortality':'mean', 'gdp':spread}
# Aggregate by_year_region using the dictionary: aggregated
aggregated = by_year_region.agg(aggregator)
aggregated.tail(6)
Groupby operations can also be performed on transformations of the index values. In the case of a DateTimeIndex, we can extract portions of the datetime over which to group.
In this exercise you'll read in a set of sample sales data from February 2015 and assign the 'Date' column as the index. Your job is to group the sales data by the day of the week and aggregate the sum of the 'Units' column.
Is there a day of the week that is more popular for customers? To find out, you're going to use .strftime('%a') to transform the index datetime values to abbreviated days of the week.
The sales data CSV file is available to you as 'sales.csv'.
Instructions
sales_values = np.array([['2015-02-02 08:30:00', 'Hooli', 'Software', 3],
['2015-02-02 21:00:00', 'Mediacore', 'Hardware', 9],
['2015-02-03 14:00:00', 'Initech', 'Software', 13],
['2015-02-04 15:30:00', 'Streeplex', 'Software', 13],
['2015-02-04 22:00:00', 'Acme Coporation', 'Hardware', 14],
['2015-02-05 02:00:00', 'Acme Coporation', 'Software', 19],
['2015-02-05 22:00:00', 'Hooli', 'Service', 10],
['2015-02-07 23:00:00', 'Acme Coporation', 'Hardware', 1],
['2015-02-09 09:00:00', 'Streeplex', 'Service', 19],
['2015-02-09 13:00:00', 'Mediacore', 'Software', 7],
['2015-02-11 20:00:00', 'Initech', 'Software', 7],
['2015-02-11 23:00:00', 'Hooli', 'Software', 4],
['2015-02-16 12:00:00', 'Hooli', 'Software', 10],
['2015-02-19 11:00:00', 'Mediacore', 'Hardware', 16],
['2015-02-19 16:00:00', 'Mediacore', 'Service', 10],
['2015-02-21 05:00:00', 'Mediacore', 'Software', 3],
['2015-02-21 20:30:00', 'Hooli', 'Hardware', 3],
['2015-02-25 00:30:00', 'Initech', 'Service', 10],
['2015-02-26 09:00:00', 'Streeplex', 'Service', 4]])
sales_cols = ['Date', 'Company', 'Product', 'Units']
sales = pd.DataFrame(sales_values, columns=sales_cols)
sales['Date'] = pd.to_datetime(sales['Date'])
sales.set_index('Date', inplace=True)
sales['Units'] = sales['Units'].astype('int64')
sales.head()
sales.info()
# Create a groupby object: by_day
by_day = sales.groupby(sales.index.strftime('%a'))
# Create sum: units_sum
units_sum = by_day['Units'].sum()
units_sum
def zscore_def(series):
return (series - series.mean()) / series.std()
auto = pd.read_csv(auto_mpg)
auto.head()
zscore_def(auto['mpg']).head()
# imported from scipy at the top
zscore(auto['mpg'])[:5]
auto.groupby('yr')['mpg'].transform(zscore_def).head()
def zscore_with_year_and_name(group):
df = pd.DataFrame(
{'mpg': zscore(group['mpg']),
'year': group['yr'],
'name': group['name']})
return df
auto.groupby('yr').apply(zscore_with_year_and_name).head()
As Dhavide demonstrated in the video using the zscore function, you can apply a .transform() method after grouping to apply a function to groups of data independently. The z-score is also useful to find outliers: a z-score value of +/- 3 is generally considered to be an outlier.
In this example, you're going to normalize the Gapminder data in 2010 for life expectancy and fertility by the z-score per region. Using boolean indexing, you will filter out countries that have high fertility rates and low life expectancy for their region.
The Gapminder DataFrame for 2010 indexed by 'Country' is provided for you as gapminder_2010.
Instructions
gapminder = pd.read_csv(gapminder_data, index_col='Country')
gapminder_mask = gapminder['Year'] == 2010
gapminder_2010 = gapminder[gapminder_mask].copy()
gapminder_2010.drop('Year', axis=1, inplace=True)
gapminder_2010.head()
# Group gapminder_2010: standardized
standardized = gapminder_2010.groupby('region')[['life', 'fertility']].transform(zscore)
standardized.head()
# Construct a Boolean Series to identify outliers: outliers
outliers = (standardized['life'] < -3) | (standardized['fertility'] > 3)
# Filter gapminder_2010 by the outliers: gm_outliers
gm_outliers = gapminder_2010.loc[outliers]
gm_outliers
Many statistical and machine learning packages cannot determine the best action to take when missing data entries are encountered. Dealing with missing data is natural in pandas (both in using the default behavior and in defining a custom behavior). In Chapter 1, you practiced using the .dropna() method to drop missing values. Now, you will practice imputing missing values. You can use .groupby() and .transform() to fill missing data appropriately for each group.
Your job is to fill in missing 'age' values for passengers on the Titanic with the median age from their 'gender' and 'pclass'. To do this, you'll group by the 'sex' and 'pclass' columns and transform each group with a custom function to call .fillna() and impute the median value.
The DataFrame has been pre-loaded as titanic. Explore it in the IPython Shell by printing the output of titanic.tail(10). Notice in particular the NaNs in the 'age' column.
Instructions
titanic = pd.read_csv(titanic_data)
titanic.head()
# Create a groupby object: by_sex_class
by_sex_class = titanic.groupby(['sex', 'pclass'])
# Write a function that imputes median
def impute_median(series):
return series.fillna(series.median())
# Impute age and assign to titanic['age']
titanic.age = by_sex_class['age'].transform(impute_median)
titanic.tail(10)
The .apply() method when used on a groupby object performs an arbitrary function on each of the groups. These functions can be aggregations, transformations or more complex workflows. The .apply() method will then combine the results in an intelligent way.
In this exercise, you're going to analyze economic disparity within regions of the world using the Gapminder data set for 2010. To do this you'll define a function to compute the aggregate spread of per capita GDP in each region and the individual country's z-score of the regional per capita GDP. You'll then select three countries - United States, Great Britain and China - to see a summary of the regional GDP and that country's z-score against the regional mean.
The 2010 Gapminder DataFrame is provided for you as gapminder_2010. Pandas has been imported as pd.
The following function has been defined for your use:
def disparity(gr):
# Compute the spread of gr['gdp']: s
s = gr['gdp'].max() - gr['gdp'].min()
# Compute the z-score of gr['gdp'] as (gr['gdp']-gr['gdp'].mean())/gr['gdp'].std(): z
z = (gr['gdp'] - gr['gdp'].mean())/gr['gdp'].std()
# Return a DataFrame with the inputs {'z(gdp)':z, 'regional spread(gdp)':s}
return pd.DataFrame({'z(gdp)':z , 'regional spread(gdp)':s})
Instructions
def disparity(gr):
# Compute the spread of gr['gdp']: s
s = gr['gdp'].max() - gr['gdp'].min()
# Compute the z-score of gr['gdp'] as (gr['gdp']-gr['gdp'].mean())/gr['gdp'].std(): z
z = (gr['gdp'] - gr['gdp'].mean())/gr['gdp'].std()
# Return a DataFrame with the inputs {'z(gdp)':z, 'regional spread(gdp)':s}
return pd.DataFrame({'z(gdp)':z , 'regional spread(gdp)':s})
gapminder_2010.head()
# Group gapminder_2010 by 'region': regional
regional = gapminder_2010.groupby('region')
# Apply the disparity function on regional: reg_disp
reg_disp = regional.apply(disparity)
# Print the disparity of 'United States', 'United Kingdom', and 'China'
reg_disp.loc[['United States', 'United Kingdom', 'China']]
auto.head()
auto.groupby('yr')['mpg'].mean()
splitting = auto.groupby('yr')
type(splitting)
type(splitting.groups)
print(splitting.groups.keys())
for group_name, group in splitting:
avg = group['mpg'].mean()
print(group_name, avg)
for group_name, group in splitting:
avg = group.loc[group['name'].str.contains('chevrolet'), 'mpg'].mean()
print(group_name, avg)
chevy_means = {year:group.loc[group['name'].str.contains('chevrolet'),'mpg'].mean() for year,group in splitting}
pd.Series(chevy_means)
chevy = auto['name'].str.contains('chevrolet')
auto.groupby(['yr', chevy])['mpg'].mean()
By using .apply(), you can write functions that filter rows within groups. The .apply() method will handle the iteration over individual groups and then re-combine them back into a Series or DataFrame.
In this exercise you'll take the Titanic data set and analyze survival rates from the 'C' deck, which contained the most passengers. To do this you'll group the dataset by 'sex' and then use the .apply() method on a provided user defined function which calculates the mean survival rates on the 'C' deck:
def c_deck_survival(gr):
c_passengers = gr['cabin'].str.startswith('C').fillna(False)
return gr.loc[c_passengers, 'survived'].mean()
The DataFrame has been pre-loaded as titanic.
Instructions
titanic = pd.read_csv(titanic_data)
titanic.head(3)
def c_deck_survival(gr):
c_passengers = gr['cabin'].str.startswith('C').fillna(False)
return gr.loc[c_passengers, 'survived'].mean()
# Create a groupby object using titanic over the 'sex' column: by_sex
by_sex = titanic.groupby('sex')
# Call by_sex.apply with the function c_deck_survival
c_surv_by_sex = by_sex.apply(c_deck_survival)
c_surv_by_sex.head()
You can use groupby with the .filter() method to remove whole groups of rows from a DataFrame based on a boolean condition.
In this exercise, you'll take the February sales data and remove entries from companies that purchased less than or equal to 35 Units in the whole month.
First, you'll identify how many units each company bought for verification. Next you'll use the .filter() method after grouping by 'Company' to remove all rows belonging to companies whose sum over the 'Units' column was less than or equal to 35. Finally, verify that the three companies whose total Units purchased were less than or equal to 35 have been filtered out from the DataFrame.
Instructions
sales_values = np.array([['2015-02-02 08:30:00', 'Hooli', 'Software', 3],
['2015-02-02 21:00:00', 'Mediacore', 'Hardware', 9],
['2015-02-03 14:00:00', 'Initech', 'Software', 13],
['2015-02-04 15:30:00', 'Streeplex', 'Software', 13],
['2015-02-04 22:00:00', 'Acme Coporation', 'Hardware', 14],
['2015-02-05 02:00:00', 'Acme Coporation', 'Software', 19],
['2015-02-05 22:00:00', 'Hooli', 'Service', 10],
['2015-02-07 23:00:00', 'Acme Coporation', 'Hardware', 1],
['2015-02-09 09:00:00', 'Streeplex', 'Service', 19],
['2015-02-09 13:00:00', 'Mediacore', 'Software', 7],
['2015-02-11 20:00:00', 'Initech', 'Software', 7],
['2015-02-11 23:00:00', 'Hooli', 'Software', 4],
['2015-02-16 12:00:00', 'Hooli', 'Software', 10],
['2015-02-19 11:00:00', 'Mediacore', 'Hardware', 16],
['2015-02-19 16:00:00', 'Mediacore', 'Service', 10],
['2015-02-21 05:00:00', 'Mediacore', 'Software', 3],
['2015-02-21 20:30:00', 'Hooli', 'Hardware', 3],
['2015-02-25 00:30:00', 'Initech', 'Service', 10],
['2015-02-26 09:00:00', 'Streeplex', 'Service', 4]])
sales_cols = ['Date', 'Company', 'Product', 'Units']
sales = pd.DataFrame(sales_values, columns=sales_cols)
sales['Date'] = pd.to_datetime(sales['Date'])
sales.set_index('Date', inplace=True)
sales['Units'] = sales['Units'].astype('int64')
sales.head()
# Group sales by 'Company': by_company
by_company = sales.groupby('Company')
# Compute the sum of the 'Units' of by_company: by_com_sum
by_com_sum = by_company['Units'].sum()
by_com_sum
# Filter 'Units' where the sum is > 35: by_com_filt
by_com_filt = by_company.filter(lambda g: g['Units'].sum() > 35)
by_com_filt
You have seen how to group by a column, or by multiple columns. Sometimes, you may instead want to group by a function/transformation of a column. The key here is that the Series is indexed the same way as the DataFrame. You can also mix and match column grouping with Series grouping.
In this exercise your job is to investigate survival rates of passengers on the Titanic by 'age' and 'pclass'. In particular, the goal is to find out what fraction of children under 10 survived in each 'pclass'. You'll do this by first creating a boolean array where True is passengers under 10 years old and False is passengers over 10. You'll use .map() to change these values to strings.
Finally, you'll group by the under 10 series and the 'pclass' column and aggregate the 'survived' column. The 'survived' column has the value 1 if the passenger survived and 0 otherwise. The mean of the 'survived' column is the fraction of passengers who lived.
The DataFrame has been pre-loaded for you as titanic.
Instructions
titanic.head(3)
# Create the Boolean Series: under10
under10 = (titanic['age'] < 10).map({True:'under 10', False:'over 10'})
under10.head()
# Group by under10 and compute the survival rate
survived_mean_1 = titanic.groupby(under10)['survived'].mean()
survived_mean_1
# Group by under10 and pclass and compute the survival rate
survived_mean_2 = titanic.groupby([under10, 'pclass'])['survived'].mean()
survived_mean_2
Here, you will bring together everything you have learned in this course while working with data recorded from the Summer Olympic games that goes as far back as 1896! This is a rich dataset that will allow you to fully apply the data manipulation techniques you have learned. You will pivot, unstack, group, slice, and reshape your data as you explore this dataset and uncover some truly fascinating insights. Enjoy!
medals = pd.read_csv(medals_data)
medals.head()
The Olympic medal data for the following exercises comes from The Guardian. It comprises records of all events held at the Olympic games between 1896 and 2012.
Suppose you have loaded the data into a DataFrame medals. You now want to find the total number of medals awarded to the USA per edition. To do this, filter the 'USA' rows and use the groupby() function to put the 'Edition' column on the index:
USA_edition_grouped = medals.loc[medals.NOC == 'USA'].groupby('Edition')
Given the goal of finding the total number of USA medals awarded per edition, what column should you select and which aggregation method should you use?
Instructions
Possible Answers
USA_edition_grouped['City'].mean()
USA_edition_grouped['Athlete'].sum()
USA_edition_grouped['Medal'].count()
USA_edition_grouped['Gender'].first()
USA_edition_grouped = medals.loc[medals.NOC == 'USA'].groupby('Edition')
USA_edition_grouped['Medal'].count().head()
For this exercise, you will use the pandas Series method .value_counts() to determine the top 15 countries ranked by total number of medals.
Notice that .value_counts() sorts by values by default. The result is returned as a Series of counts indexed by unique entries from the original Series with values (counts) ranked in descending order.
The DataFrame has been pre-loaded for you as medals.
Instructions
# Select the 'NOC' column of medals: country_names
country_names = medals['NOC']
country_names.head()
# Count the number of medals won by each country: medal_counts
medal_counts = country_names.value_counts()
# Print top 15 countries ranked by medals
print(medal_counts.head(15))
Rather than ranking countries by total medals won and showing that list, you may want to see a bit more detail. You can use a pivot table to compute how many separate bronze, silver and gold medals each country won. That pivot table can then be used to repeat the previous computation to rank by total medals won.
In this exercise, you will use .pivot_table() first to aggregate the total medals by type. Then, you can use .sum() along the columns of the pivot table to produce a new column. When the modified pivot table is sorted by the total medals column, you can display the results from the last exercise with a bit more detail.
Instructions
# Construct the pivot table: counted
counted = medals.pivot_table(index='NOC',
columns='Medal',
values='Athlete',
aggfunc='count')
counted.head()
# Create the new column: counted['totals']
counted['totals'] = counted.sum(axis='columns')
counted.head()
# Sort counted by the 'totals' column
counted = counted.sort_values('totals', ascending=False)
counted.head()
medals.loc[145:154, ['NOC', 'Gender', 'Event', 'Event_gender', 'Medal']]
What could be the difference between the 'Event_gender' and 'Gender' columns? You should be able to evaluate your guess by looking at the unique values of the pairs (Event_gender, Gender) in the data. In particular, you should not see something like (Event_gender='M', Gender='Women'). However, you will see that, strangely enough, there is an observation with (Event_gender='W', Gender='Men').
The duplicates can be dropped using the .drop_duplicates() method, leaving behind the unique observations. The DataFrame has been loaded as medals.
Instructions
medals = pd.read_csv(medals_data)
medals.head()
# Select columns: ev_gen
ev_gen = medals[['Event_gender', 'Gender']]
ev_gen.head()
# Drop duplicate pairs: ev_gen_uniques
ev_gen_uniques = ev_gen.drop_duplicates()
ev_gen_uniques
You will now use .groupby() to continue your exploration. Your job is to group by 'Event_gender' and 'Gender' and count the rows.
You will see that there is only one suspicious row: This is likely a data error.
The DataFrame is available to you as medals.
Instructions
# Group medals by the two columns: medals_by_gender
medals_by_gender = medals.groupby(['Event_gender', 'Gender'])
# Create a DataFrame with a group count: medal_count_by_gender
medal_count_by_gender = medals_by_gender.count()
medal_count_by_gender
You will now inspect the suspect record by locating the offending row.
You will see that, according to the data, Joyce Chepchumba was a man that won a medal in a women's event. That is a data error as you can confirm with a web search.
Instructions
# Create the Boolean Series: sus
sus = (medals.Event_gender == 'W') & (medals.Gender == 'Men')
# Create a DataFrame with the suspicious row: suspect
suspect = medals[sus]
suspect
sports = medals['Sport'].unique()
sports
weather = pd.DataFrame({'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
'Mean TemperatureF': [32.354839, 28.714286, 35.000000, 53.100000, 62.612903, 70.133333,
72.870968, 70.000000, 63.766667, 55.451613, 39.800000, 34.935484]})
weather.set_index('Month', inplace=True)
weather
weather.idxmax() # Returns month of highest temperature
weather.T # Returns DataFrame with single row, 12 columns
weather.T.idxmax(axis='columns')
weather.T.idxmin(axis='columns')
You may want to know which countries won medals in the most distinct sports. The .nunique() method is the principal aggregation here. Given a categorical Series S, S.nunique() returns the number of distinct categories.
Instructions
# Group medals by 'NOC': country_grouped
country_grouped = medals.groupby('NOC')
# Compute the number of distinct sports in which each country won medals: Nsports
Nsports = country_grouped['Sport'].nunique()
# Sort the values of Nsports in descending order
Nsports = Nsports.sort_values(ascending=False)
# Print the top 15 rows of Nsports
print(Nsports.head(15))
The USSR is not in the top 5 in this category, while the USA continues to remain on top. What could be the cause of this? You'll compare the medal counts of USA vs. USSR more closely in the next two exercises to find out.
The Olympic competitions between 1952 and 1988 took place during the height of the Cold War between the United States of America (USA) & the Union of Soviet Socialist Republics (USSR). Your goal in this exercise is to aggregate the number of distinct sports in which the USA and the USSR won medals during the Cold War years.
The construction is mostly the same as in the preceding exercise. There is an additional filtering stage beforehand in which you reduce the original DataFrame medals by extracting data from the Cold War period that applies only to the US or to the USSR. The relevant country codes in the DataFrame, which has been pre-loaded as medals, are 'USA' & 'URS'.
Instructions
# Extract all rows for which the 'Edition' is between 1952 & 1988: during_cold_war
during_cold_war = (1952 <= medals.Edition) & (medals.Edition <= 1988)
# Extract rows for which 'NOC' is either 'USA' or 'URS': is_usa_urs
is_usa_urs = medals.NOC.isin(['USA', 'URS'])
# Use during_cold_war and is_usa_urs to create the DataFrame: cold_war_medals
cold_war_medals = medals.loc[during_cold_war & is_usa_urs]
# Group cold_war_medals by 'NOC'
country_grouped = cold_war_medals.groupby('NOC')
# Create Nsports
Nsports = country_grouped['Sport'].nunique().sort_values(ascending=False)
Nsports
As you can see, the USSR is actually higher than the US when you look only at the Olympic competitions between 1952 and 1988.
For this exercise, you want to see which country, the USA or the USSR, won the most medals consistently over the Cold War period.
There are several steps involved in carrying out this computation.
Instructions
# Create the pivot table: medals_won_by_country
medals_won_by_country = medals.pivot_table(index='Edition',
columns='NOC',
values='Athlete',
aggfunc='count')
medals_won_by_country.head()
# Slice medals_won_by_country: cold_war_usa_urs_medals
cold_war_usa_urs_medals = medals_won_by_country.loc[1952:1988 , ['USA','URS']]
# Create most_medals
most_medals = cold_war_usa_urs_medals.idxmax(axis='columns')
# Print most_medals.value_counts()
most_medals.value_counts()
medals = pd.read_csv(medals_data)
all_medals = medals.groupby('Edition')['Athlete'].count()
all_medals.head(6) # Series for all medals, all years
all_medals.plot(kind='line', marker='.')
plt.show()
france = medals.NOC == 'FRA' # Boolean Series for France
france_grps = medals[france].groupby(['Edition', 'Medal'])
france_grps['Athlete'].count().head(10)
france_medals = france_grps['Athlete'].count().unstack()
france_medals.head(12) # Single level index
france_medals.plot(kind='line', marker='.')
plt.show()
Your job in this exercise is to visualize the medal counts by 'Edition' for the USA. The DataFrame has been pre-loaded for you as medals.
Instructions
medals.head()
# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']
usa.head()
# Group usa by ['Edition', 'Medal'] and aggregate over 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()
usa_medals_by_year.head()
# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')
# Plot the DataFrame usa_medals_by_year
usa_medals_by_year.plot()
plt.show()
As in the previous exercise, your job in this exercise is to visualize the medal counts by 'Edition' for the USA. This time, you will use an area plot to see the breakdown better. The usa DataFrame has been created and all reshaping from the previous exercise has been done. You need to write the plotting command.
Instructions
# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']
# Group usa by 'Edition', 'Medal', and 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()
# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')
# Create an area plot of usa_medals_by_year
usa_medals_by_year.plot.area()
plt.show()
usa_medals_by_year.head()
medals.info()
You may have noticed that the medals are ordered according to a lexicographic (dictionary) ordering: Bronze < Gold < Silver. However, you would prefer an ordering consistent with the Olympic rules: Bronze < Silver < Gold.
You can achieve this using Categorical types. In this final exercise, after redefining the 'Medal' column of the DataFrame medals, you will repeat the area plot from the previous exercise to see the new ordering.
Instructions
# Redefine 'Medal' as an ordered categorical
medals.Medal = pd.Categorical(values=medals.Medal,
categories=['Bronze', 'Silver', 'Gold'],
ordered=True)
# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']
# Group usa by 'Edition', 'Medal', and 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()
# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')
# Create an area plot of usa_medals_by_year
usa_medals_by_year.plot.area()
plt.show()
medals.head(3)
medals.info()
usa_medals_by_year.head()
You can now…
import sys
# These are the usual ipython objects, including this one you are creating
ipython_vars = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars']
# Get a sorted list of the objects and their sizes
sorted([(x, sys.getsizeof(globals().get(x))) for x in dir() if not x.startswith('_')
and x not in sys.modules and x not in ipython_vars], key=lambda x: x[1], reverse=True)