Titanic Missing Data Imputation Comparison

Posted Aug 29, 2020 Updated Apr 27, 2024

By Trenton McKinney 8 min read

This is a reproduction of a Stack Overflow answer I provided.
You will see that the two fill methods, groupby fillna with mean and random forest regressor, are within a couple of 1/100’s of a year of each other
- See Comparison of groupby and RandomForestRegressor for the statistical comparison.

Fill `nan` values with the mean

Use .groupby, .apply, and fillna with .mean.
The following code fills nans with the mean for each group, pclass and sex, for the entire dataset.
Titanic Age Analysis

  
import pandas as pd
import seaborn as sns

# load dataset
df = sns.load_dataset('titanic')

# map sex to a numeric type
df.sex = df.sex.map({'male': 1, 'female': 0})

# Populate Age_Fill
df['Age_Fill'] = df['age'].groupby([df['pclass'], df['sex']]).apply(lambda x: x.fillna(x.mean()))

# series with filled ages
groupby_result = df.Age_Fill[df.age.isnull()]

# display(df[df.age.isnull()].head())
 survived  pclass     sex  age  sibsp  parch     fare embarked   class    who  adult_male deck  embark_town alive  alone  Age_Fill
        0       3    male  NaN      0      0   8.4583        Q   Third    man        True  NaN   Queenstown    no   True  26.50759
        1       2    male  NaN      0      0  13.0000        S  Second    man        True  NaN  Southampton   yes   True  30.74071
        1       3  female  NaN      0      0   7.2250        C   Third  woman       False  NaN    Cherbourg   yes   True  21.75000
        0       3    male  NaN      0      0   7.2250        C   Third    man        True  NaN    Cherbourg    no   True  26.50759
        1       3  female  NaN      0      0   7.8792        Q   Third  woman       False  NaN   Queenstown   yes   True  21.75000

Impute `nan` values with `RandomForestRegressor`

sklearn.ensemble.RandomForestRegressor
Kaggle: Titanic
- Age seems to be promising feature. So it doesn’t make sense to simply fill null values out with median/mean/mode.
- Based on the results here, I don’t think it makes much difference

  
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import seaborn as sns

# load dataset
df = sns.load_dataset('titanic')

# map sex to a numeric type
df.sex = df.sex.map({'male': 1, 'female': 0})

# split data
train = df.loc[(df.age.notnull())]  # known age values
test = df.loc[(df.age.isnull())]  # all nan age values

# select age column
y = train.values[:, 3]

# select pclass and sex
X = train.values[:, [1, 2]]

# create RandomForestRegressor model
rfr = RandomForestRegressor(n_estimators=2000, n_jobs=-1)

# Fit a model
rfr.fit(X, y)

# Use the fitted model to predict the missing values
predictedAges = rfr.predict(test.values[:, [1, 2]])

# create predicted age column
df['pred_age'] = df.age

# fill column
df.loc[(df.pred_age.isnull()), 'pred_age'] = predictedAges 

# display(df[df.age.isnull()].head())
 survived  pclass  sex  age  sibsp  parch     fare embarked   class    who  adult_male deck  embark_town alive  alone  pred_age
        0       3    1  NaN      0      0   8.4583        Q   Third    man        True  NaN   Queenstown    no   True  26.49935
        1       2    1  NaN      0      0  13.0000        S  Second    man        True  NaN  Southampton   yes   True  30.73126
        1       3    0  NaN      0      0   7.2250        C   Third  woman       False  NaN    Cherbourg   yes   True  21.76513
        0       3    1  NaN      0      0   7.2250        C   Third    man        True  NaN    Cherbourg    no   True  26.49935
        1       3    0  NaN      0      0   7.8792        Q   Third  woman       False  NaN   Queenstown   yes   True  21.76513

Comparison of `groupby` and `RandomForestRegressor`

  
print(predictedAges - groupby_result).describe())

count    177.00000
mean       0.00362
std        0.01877
min       -0.04167
25%        0.01121
50%        0.01121
75%        0.01131
max        0.02969
Name: Age_Fill, dtype: float64

# comparison dataframe
comp = pd.DataFrame({'rfr': predictedAges.tolist(), 'gb': groupby_result.tolist()})
comp['diff'] = comp.rfr - comp.gb

# display(comp)
      rfr        gb     diff
51880  26.50759  0.01121
69903  30.74071 -0.04167
76131  21.75000  0.01131
51880  26.50759  0.01121
76131  21.75000  0.01131
51880  26.50759  0.01121
63090  34.61176  0.01913
76131  21.75000  0.01131
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
76131  21.75000  0.01131
51880  26.50759  0.01121
24592  41.28139 -0.03547
24592  41.28139 -0.03547
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
76131  21.75000  0.01131
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
76131  21.75000  0.01131
51880  26.50759  0.01121
51880  26.50759  0.01121
76131  21.75000  0.01131
76131  21.75000  0.01131
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
63090  34.61176  0.01913
24592  41.28139 -0.03547
51880  26.50759  0.01121
76131  21.75000  0.01131
69903  30.74071 -0.04167
24592  41.28139 -0.03547
76131  21.75000  0.01131
51880  26.50759  0.01121
76131  21.75000  0.01131
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
76131  21.75000  0.01131
76131  21.75000  0.01131
76131  21.75000  0.01131
76131  21.75000  0.01131
51880  26.50759  0.01121
63090  34.61176  0.01913
51880  26.50759  0.01121
76131  21.75000  0.01131
24592  41.28139 -0.03547
76131  21.75000  0.01131
69903  30.74071 -0.04167
24592  41.28139 -0.03547
24592  41.28139 -0.03547
24592  41.28139 -0.03547
76131  21.75000  0.01131
51880  26.50759  0.01121
75266  28.72297  0.02969
51880  26.50759  0.01121
63090  34.61176  0.01913
51880  26.50759  0.01121
76131  21.75000  0.01131
63090  34.61176  0.01913
51880  26.50759  0.01121
76131  21.75000  0.01131
24592  41.28139 -0.03547
51880  26.50759  0.01121
76131  21.75000  0.01131
76131  21.75000  0.01131
51880  26.50759  0.01121
76131  21.75000  0.01131
76131  21.75000  0.01131
63090  34.61176  0.01913
51880  26.50759  0.01121
51880  26.50759  0.01121
76131  21.75000  0.01131
51880  26.50759  0.01121
51880  26.50759  0.01121
69903  30.74071 -0.04167
76131  21.75000  0.01131
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
76131  21.75000  0.01131
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
63090  34.61176  0.01913
51880  26.50759  0.01121
51880  26.50759  0.01121
69903  30.74071 -0.04167
51880  26.50759  0.01121
51880  26.50759  0.01121
24592  41.28139 -0.03547
69903  30.74071 -0.04167
76131  21.75000  0.01131
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
76131  21.75000  0.01131
24592  41.28139 -0.03547
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
24592  41.28139 -0.03547
51880  26.50759  0.01121
76131  21.75000  0.01131
51880  26.50759  0.01121
69903  30.74071 -0.04167
51880  26.50759  0.01121
24592  41.28139 -0.03547
51880  26.50759  0.01121
51880  26.50759  0.01121
76131  21.75000  0.01131
51880  26.50759  0.01121
76131  21.75000  0.01131
76131  21.75000  0.01131
51880  26.50759  0.01121
51880  26.50759  0.01121
76131  21.75000  0.01131
75266  28.72297  0.02969
51880  26.50759  0.01121
51880  26.50759  0.01121
24592  41.28139 -0.03547
51880  26.50759  0.01121
76131  21.75000  0.01131
51880  26.50759  0.01121
51880  26.50759  0.01121
24592  41.28139 -0.03547
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
76131  21.75000  0.01131
51880  26.50759  0.01121
51880  26.50759  0.01121
63090  34.61176  0.01913
69903  30.74071 -0.04167
76131  21.75000  0.01131
51880  26.50759  0.01121
76131  21.75000  0.01131
51880  26.50759  0.01121
24592  41.28139 -0.03547
51880  26.50759  0.01121
76131  21.75000  0.01131
69903  30.74071 -0.04167
51880  26.50759  0.01121
51880  26.50759  0.01121
24592  41.28139 -0.03547
51880  26.50759  0.01121
24592  41.28139 -0.03547
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
76131  21.75000  0.01131
24592  41.28139 -0.03547
24592  41.28139 -0.03547
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
51880  26.50759  0.01121
24592  41.28139 -0.03547
51880  26.50759  0.01121
63090  34.61176  0.01913
51880  26.50759  0.01121
76131  21.75000  0.01131
51880  26.50759  0.01121
51880  26.50759  0.01121
76131  21.75000  0.01131

Calculate means on a random training set

This example calculates the mean of a random training set, an then fills the nan values in the training set and the test set
Using pandas.DataFrame.fillna, which will fill missing values in a dataframe column, from another dataframe, when both dataframes have a matching index, and the fill column is same.
- Pclass/Sex and not based on indices, pclass and sex are set as the indices, which is how .fillna works.
In this example, train is 67% of the data, and test is 33% of the data.
- test_size and train_size can be set as needed, as per sklearn.model_selection.train_test_split

  
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split

# load dataset
df = sns.load_dataset('titanic')

# map sex to a numeric type
df.sex = df.sex.map({'male': 1, 'female': 0})

# randomly split the dataframe into a train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# select columns for X and y
X = df[['pclass', 'sex']]
y = df['age']

# create a dataframe of train (X, y) and test (X, y)
train = pd.concat([X_train, y_train], axis=1).reset_index(drop=True)
test = pd.concat([X_test, y_test], axis=1).reset_index(drop=True)

# calculate means for train
train_means = train.groupby(['pclass', 'sex']).agg({'age': 'mean'})

# display train_means, a multi-index dataframe
                 age
pclass sex          
1      0    34.66667
       1    41.38710
2      0    27.90217
       1    30.50000
3      0    21.56338
       1    26.87163

# fill nan values in train
train = train.set_index(['pclass', 'sex']).age.fillna(train_means.age).reset_index()

# fill nan values in test
test = test.set_index(['pclass', 'sex']).age.fillna(train_means.age).reset_index()

Python, Machine Learning, Data Analysis, Data Visualization, Data Science

This post is licensed under CC BY 4.0 by the author.

Fill nan values with the mean

Impute nan values with RandomForestRegressor

Comparison of groupby and RandomForestRegressor

Calculate means on a random training set

Trending Tags

Fill `nan` values with the mean

Impute `nan` values with `RandomForestRegressor`

Comparison of `groupby` and `RandomForestRegressor`