scikit-learn v0.19.2
, pandas v0.19.2
, and numpy v1.17.4
v0.24.1
, v1.2.3
, and v1.19.2
respectively, so there are differences in model performance compared to the course.create_dir_save_file
) to automatically download and save the required data (data/2020-10-14_supervised_learning_sklearn
) and image (Images/2020-10-14_supervised_learning_sklearn
) files.Machine learning is the field that teaches machines and computers to learn from existing data to make predictions on new data: Will a tumor be benign or malignant? Which of your customers will take their business elsewhere? Is a particular email spam? In this course, you'll learn how to use Python to perform supervised learning, an essential component of machine learning. You'll learn how to build predictive models, tune their parameters, and determine how well they will perform with unseen data—all while using real world datasets. You'll be using scikit-learn, one of the most popular and user-friendly machine learning libraries for Python.
import pandas as pd
import numpy as np
from pprint import pprint as pp
from itertools import combinations
import requests
from pathlib import Path
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import randint
from matplotlib.colors import ListedColormap
from sklearn.datasets import load_iris, load_digits
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression, ElasticNet
from sklearn.metrics import mean_squared_error, confusion_matrix, classification_report, roc_curve, precision_recall_curve, roc_auc_score, accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import scale, StandardScaler
import warnings
warnings.simplefilter("ignore")
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)
# plt.style.use('ggplot')
plt.rcParams["patch.force_edgecolor"] = True
def create_dir_save_file(dir_path: Path, url: str):
"""
Check if the path exists and create it if it does not.
Check if the file exists and download it if it does not.
"""
if not dir_path.parents[0].exists():
dir_path.parents[0].mkdir(parents=True)
print(f'Directory Created: {dir_path.parents[0]}')
else:
print('Directory Exists')
if not dir_path.exists():
r = requests.get(url, allow_redirects=True)
open(dir_path, 'wb').write(r.content)
print(f'File Created: {dir_path.name}')
else:
print('File Exists')
data_dir = Path('data/2020-10-14_supervised_learning_sklearn')
images_dir = Path('Images/2020-10-14_supervised_learning_sklearn')
file_mpg = 'https://assets.datacamp.com/production/repositories/628/datasets/3781d588cf7b04b1e376c7e9dda489b3e6c7465b/auto.csv'
file_housing = 'https://assets.datacamp.com/production/repositories/628/datasets/021d4b9e98d0f9941e7bfc932a5787b362fafe3b/boston.csv'
file_diabetes = 'https://assets.datacamp.com/production/repositories/628/datasets/444cdbf175d5fbf564b564bd36ac21740627a834/diabetes.csv'
file_gapminder = 'https://assets.datacamp.com/production/repositories/628/datasets/a7e65287ebb197b1267b5042955f27502ec65f31/gm_2008_region.csv'
file_voting = 'https://assets.datacamp.com/production/repositories/628/datasets/35a8c54b79d559145bbeb5582de7a6169c703136/house-votes-84.csv'
file_wwine = 'https://assets.datacamp.com/production/repositories/628/datasets/2d9076606fb074c66420a36e06d7c7bc605459d4/white-wine.csv'
file_rwine = 'https://assets.datacamp.com/production/repositories/628/datasets/013936d2700e2d00207ec42100d448c23692eb6f/winequality-red.csv'
datasets = [file_mpg, file_housing, file_diabetes, file_gapminder, file_voting, file_wwine, file_rwine]
data_paths = list()
for data in datasets:
file_name = data.split('/')[-1].replace('?raw=true', '')
data_path = data_dir / file_name
create_dir_save_file(data_path, data)
data_paths.append(data_path)
In this chapter, you will be introduced to classification problems and learn how to solve them using supervised learning techniques. And you'll apply what you learn to a political dataset, where you classify the party affiliation of United States congressmen based on their voting records.
What is machine learning?
Unsupervised learning
Reinforcement learning
Supervised learning
| | Predictor Variables | Target |
| | sepal_length | sepal_width | petal_length | petal_width | species |
|---:|---------------:|--------------:|---------------:|--------------:|:----------|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
'click'
or 'no click'
, 'spam'
or 'not spam'
, or different species of flowers, we call the learning task, classification.Supervised learning in python
scikit-learn
, or sklearn
, one of the most popular and use-friendly machine learning libraries for Python.SciPy
stack, including libraries such as NumPy
.TensorFlow
and keras
, which are well worth checking out, once you get the basics down.Naming conventions
Once you decide to leverage supervised machine learning to solve a new problem, you need to identify whether your problem is better suited to classification or regression. This exercise will help you develop your intuition for distinguishing between the two.
Provided below are 4 example applications of machine learning. Which of them is a supervised classification problem?
Answer the question
iris = load_iris()
print(f'Type: {type(iris)}')
print(f'Keys: {iris.keys()}')
print(f'Data Type: {type(iris.data)}\nTarget Type: {type(iris.target)}')
print(f'Data Shape: {iris.data.shape}')
print(f'Target Names: {iris.target_names}')
X = iris.data
y = iris.target
df = pd.DataFrame(X, columns=iris.feature_names)
df['label'] = y
species_map = dict(zip(range(3), iris.target_names))
df['species'] = df.label.map(species_map)
df = df.reindex(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'species', 'label'], axis=1)
display(df.head())
# pd.plotting.scatter_matrix(df, c=y, figsize=(12, 10))
sns.pairplot(df.iloc[:, :5], hue='species', corner=True)
plt.show()
In this chapter, you'll be working with a dataset obtained from the UCI Machine Learning Repository consisting of votes made by US House of Representatives Congressmen. Your goal will be to predict their party affiliation ('Democrat' or 'Republican') based on how they voted on certain key issues. Here, it's worth noting that we have preprocessed this dataset to deal with missing values. This is so that your focus can be directed towards understanding how to train and evaluate supervised learning models. Once you have mastered these fundamentals, you will be introduced to preprocessing techniques in Chapter 4 and have the chance to apply them there yourself - including on this very same dataset!
Before thinking about what supervised learning models you can apply to this, however, you need to perform Exploratory data analysis (EDA) in order to understand the structure of the data. For a refresher on the importance of EDA, check out the first two chapters of Statistical Thinking in Python (Part 1).
Get started with your EDA now by exploring this voting records dataset numerically. It has been pre-loaded for you into a DataFrame called df
. Use pandas' .head()
, .info()
, and .describe()
methods in the IPython Shell to explore the DataFrame, and select the statement below that is not true.
cols = ['party', 'infants', 'water', 'budget', 'physician', 'salvador', 'religious', 'satellite', 'aid',
'missile', 'immigration', 'synfuels', 'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa']
votes = pd.read_csv(data_paths[4], header=None, names=cols)
votes.iloc[:, 1:] = votes.iloc[:, 1:].replace({'?': None, 'n': 0, 'y': 1})
votes.head()
Possible Answers
435
rows and 17
columns.'party'
, all of the columns are of type int64
.'party'
is the target variable.'party'
.The Numerical EDA you did in the previous exercise gave you some very important information, such as the names and data types of the columns, and the dimensions of the DataFrame. Following this with some visual EDA will give you an even better understanding of the data. In the video, Hugo used the scatter_matrix()
function on the Iris data for this purpose. However, you may have noticed in the previous exercise that all the features in this dataset are binary; that is, they are either 0 or 1. So a different type of plot would be more useful here, such as seaborn.countplot.
Given on the right is a countplot
of the 'education'
bill, generated from the following code:
plt.figure()
sns.countplot(x='education', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()
In sns.countplot()
, we specify the x-axis data to be 'education'
, and hue to be 'party'
. Recall that 'party'
is also our target variable. So the resulting plot shows the difference in voting behavior between the two parties for the 'education'
bill, with each party colored differently. We manually specified the color to be 'RdBu'
, as the Republican party has been traditionally associated with red, and the Democratic party with blue.
It seems like Democrats voted resoundingly against this bill, compared to Republicans. This is the kind of information that our machine learning model will seek to learn when we try to predict party affiliation solely based on voting behavior. An expert in U.S politics may be able to predict this without machine learning, but probably not instantaneously - and certainly not if we are dealing with hundreds of samples!
In the IPython Shell, explore the voting behavior further by generating countplots for the 'satellite'
and 'missile'
bills, and answer the following question: Of these two bills, for which ones do Democrats vote resoundingly in favor of, compared to Republicans? Be sure to begin your plotting statements for each figure with plt.figure()
so that a new figure will be set up. Otherwise, your plots will be overlayed onto the same figure.
# in order to use catplot, the dataframe needs to be in a tidy format
vl = votes.set_index('party').stack().reset_index().rename(columns={'level_1': 'cat', 0: 'vote'})
g = sns.catplot(data=vl, x='vote', col='cat', col_wrap=4, hue='party', kind='count', height=3, palette='RdBu')
Possible Answers
'satellite'
.'missile'
.'satellite'
and 'missile'
.'satellite'
nor 'missile'
.k-Nearest Neighbors (KNN)
k=3
, you would classify it as redk=5
, you would classify it as greenKNN: Intuition
scikit-learn
fit
and predict
scikit-learn
are implemented as python classesscikit-learn
we use the .fit()
method to do this..predict()
is used to predict the label of an unlabeled data point.# instantiate model
knn = KNeighborsClassifier(n_neighbors=6)
# predict for 'petal length (cm)' and 'petal width (cm)'
knn.fit(df.iloc[:, 2:4], df.label)
h = .02 # step size in the mesh
# create colormap for the contour plot
cmap_light = ListedColormap(list(sns.color_palette('pastel', n_colors=3)))
# Plot the decision boundary.
# For that, we will assign a color to each point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = df['petal length (cm)'].min() - 1, df['petal length (cm)'].max() + 1
y_min, y_max = df['petal width (cm)'].min() - 1, df['petal width (cm)'].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
# create plot
fig, ax = plt.subplots()
# add data points
sns.scatterplot(data=df, x='petal length (cm)', y='petal width (cm)', hue='species', ax=ax, edgecolor='k')
# add decision boundary countour map
ax.contourf(xx, yy, Z, cmap=cmap_light, alpha=0.4)
# legend
lgd = plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
# plt.show()
plt.close()
scikit-learn
to fit a classifier¶from sklearn.neighbors import KNeighborsClassifier
pandas.DataFrame
or as a numpy.array
'male'
or 'female'
.# new data
X_new = np.array([[5.6, 2.8, 3.9, 1.1], [5.7, 2.6, 3.8, 1.3], [4.7, 3.2, 1.3, 0.2]])
fig, ax = plt.subplots()
sns.scatterplot(data=df, x='petal length (cm)', y='petal width (cm)', hue='species', ax=ax, edgecolor='k')
sns.scatterplot(x=X_new[:, 2], y=X_new[:, 3], ax=ax, color='magenta', label='uncategorized', s=70)
plt.show()
# instantiate the model, and set the number of neighbors
knn = KNeighborsClassifier(n_neighbors=6)
# fit the model to the training set, the labeled data
knn.fit(df.iloc[:, :4], df.label)
# predit the label of the new data
pred = knn.predict(X_new)
spcies_pred = list(map(species_map.get, pred))
print(f'Predicted Label: {pred}\nSpecies: {spcies_pred}')
Having explored the Congressional voting records dataset, it is time now to build your first classifier. In this exercise, you will fit a k-Nearest Neighbors classifier to the voting dataset, which has once again been pre-loaded for you into a DataFrame df
.
In the video, Hugo discussed the importance of ensuring your data adheres to the format required by the scikit-learn API. The features need to be in an array where each column is a feature and each row a different observation or data point - in this case, a Congressman's voting record. The target needs to be a single column with the same number of observations as the feature data. We have done this for you in this exercise. Notice we named the feature array X
and response variable y
: This is in accordance with the common scikit-learn practice.
Your job is to create an instance of a k-NN classifier with 6 neighbors (by specifying the n_neighbors
parameter) and then fit it to the data. The data has been pre-loaded into a DataFrame called df
.
Instructions
KNeighborsClassifier
from sklearn.neighbors
.X
and y
for the features and the target variable. Here this has been done for you. Note the use of .drop()
to drop the target variable 'party'
from the feature array X
as well as the use of the .values
attribute to ensure X
and y
are NumPy arrays. Without using .values
, X
and y
are a DataFrame and Series respectively; the scikit-learn API will accept them in this form also as long as they are of the right shape.KNeighborsClassifier
called knn
with 6
neighbors by specifying the n_neighbors
parameter..fit()
method.v_na = votes.dropna().reset_index(drop=True)
v_na.head()
# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)
# Fit the classifier to the data
knn.fit(v_na.iloc[:, 1:], v_na.party)
Now that your k-NN classifier with 6 neighbors has been fit to the data, it can be used to predict the labels of new data points.
Having fit a k-NN classifier, you can now use it to predict the label of a new data point. However, there is no unlabeled data available since all of it was used to fit the model! You can still use the .predict()
method on the X
that was used to fit the model, but it is not a good indicator of the model's ability to generalize to new, unseen data.
In the next video, Hugo will discuss a solution to this problem. For now, a random unlabeled data point has been generated and is available to you as X_new
. You will use your classifier to predict the label for this new data point, as well as on the training data X
that the model has already seen. Using .predict()
on X_new
will generate 1 prediction, while using it on X
will generate 435 predictions: 1 for each sample.
The DataFrame has been pre-loaded as df
. This time, you will create the feature array X
and target variable array y
yourself.
Instructions
df
. As a reminder, the target variable is 'party'
.KNeighborsClassifier
with 6
neighbors.X
.X_new
.X_new = np.array([[0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897, 0.42310646, 0.9807642 , 0.68482974,
0.4809319 , 0.39211752, 0.34317802, 0.72904971, 0.43857224, 0.0596779 , 0.39804426, 0.73799541]])
# Create arrays for the features and the response variable
y = v_na.party
X = v_na.iloc[:, 1:]
# Create a k-NN classifier with 6 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=6)
# Fit the classifier to the data
knn.fit(X, y)
# Predict the labels for the training data X
y_pred = knn.predict(X)
# Predict and print the label for the new data point X_new
new_prediction = knn.predict(X_new)
print(f'Prediction: {new_prediction}')
Did your model predict 'democrat' or 'republican'? How sure can you be of its predictions? In other words, how can you measure its performance? This is what you will learn in the next video.
Train Test Split
sklearn.model_selection.train_test_split
random_state
sets a seed for the random number generator that splits the data into train and test, which allows for reproducing the exact split of the data.test_size
.stratify=y
, where y
is the array or dataframe of labels.Model complexity and over / underfitting
K
in a KNN model.K
increases, the decision boundary get smoother and less curvy.K
.K
even more, and make the model even simpler, then the model will perform less well on both test and training sets, as indicated in the following schematic figure, known as a model complexity curve.df.head()
# from sklearn.model_selection import train_test_split
# split the data
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :4], df.species, test_size=0.3, random_state=21, stratify=df.species)
# instantiate the classifier
knn = KNeighborsClassifier(n_neighbors=8)
# fit it to the training data
knn.fit(X_train, y_train)
# make predictions on the test data
y_pred = knn.predict(X_test)
# check the accuracy using the score method of the model
score = knn.score(X_test, y_test)
# print the predictions and score
print(f'Test set score: {score:0.3f}\nTest set predictions:\n{y_pred}')
Up until now, you have been performing binary classification, since the target variable had two possible outcomes. Hugo, however, got to perform multi-class classification in the videos, where the target variable could take on three possible outcomes. Why does he get to have all the fun?! In the following exercises, you'll be working with the MNIST digits recognition dataset, which has 10 classes, the digits 0 through 9! A reduced version of the MNIST dataset is one of scikit-learn's included datasets, and that is the one we will use in this exercise.
Each sample in this scikit-learn dataset is an 8x8 image representing a handwritten digit. Each pixel is represented by an integer in the range 0 to 16, indicating varying levels of black. Recall that scikit-learn's built-in datasets are of type Bunch
, which are dictionary-like objects. Helpfully for the MNIST dataset, scikit-learn provides an 'images'
key in addition to the 'data'
and 'target'
keys that you have seen with the Iris data. Because it is a 2D array of the images corresponding to each sample, this 'images'
key is useful for visualizing the images, as you'll see in this exercise (for more on plotting 2D arrays, see Chapter 2 of DataCamp's course on Data Visualization with Python). On the other hand, the 'data'
key contains the feature array - that is, the images as a flattened array of 64 pixels.
Notice that you can access the keys of these Bunch
objects in two different ways: By using the .
notation, as in digits.images
, or the []
notation, as in digits['images']
.
For more on the MNIST data, check out this exercise in Part 1 of DataCamp's Importing Data in Python course. There, the full version of the MNIST dataset is used, in which the images are 28x28. It is a famous dataset in machine learning and computer vision, and frequently used as a benchmark to evaluate the performance of a new model.
Instructions
datasets
from sklearn
and matplotlib.pyplot
as plt
..load_digits()
method on datasets
.DESCR
of digits.images
and data
keys using the .
notation.plt.imshow()
. This has been done for you, so hit 'Submit Answer' to see which handwritten digit this happens to be!# Load the digits dataset: digits
digits = load_digits()
# Print the keys and DESCR of the dataset
print(digits.keys())
print(digits.DESCR)
# Print the shape of the images and data keys
print(digits.images.shape)
print(digits.data.shape)
# Display digit 1010
plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
It looks like the image in question corresponds to the digit '5'. Now, can you build a classifier that can make this prediction not only for this image, but for all the other ones in the dataset? You'll do so in the next exercise!
Now that you have learned about the importance of splitting your data into training and test sets, it's time to practice doing this on the digits dataset! After creating arrays for the features and target variable, you will split them into training and test sets, fit a k-NN classifier to the training data, and then compute its accuracy using the .score()
method.
Instructions
KNeighborsClassifier
from sklearn.neighbors
and train_test_split
from sklearn.model_selection
.digits.data
and an array for the target using digits.target
.0.2
for the size of the test set. Use a random state of 42
. Stratify the split according to the labels so that they are distributed in the training and test sets as they are in the original dataset.7
neighbors and fit it to the training data..score()
method.# Create feature and target arrays
X = digits.data
y = digits.target
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=7)
# Fit the classifier to the training data
knn.fit(X_train, y_train)
# predict
pred = knn.predict(X_test)
result = list(zip(pred, y_test))
not_correct = [v for v in result if v[0] != v[1]]
num_correct = len(result) - len(not_correct)
# Print the accuracy
score = knn.score(X_test, y_test)
print(f'Incorrect Result: {not_correct}\nNumber Correct: {num_correct}\nScore: {score:0.2f}')
Incredibly, this out of the box k-NN classifier with 7 neighbors has learned from the training data and predicted the labels of the images in the test set with 98% accuracy, and it did so in less than a second! This is one illustration of how incredibly useful machine learning techniques can be.
Remember the model complexity curve that Hugo showed in the video? You will now construct such a curve for the digits dataset! In this exercise, you will compute and plot the training and testing accuracy scores for a variety of different neighbor values. By observing how the accuracy scores differ for the training and testing sets with different values of k, you will develop your intuition for overfitting and underfitting.
The training and testing sets are available to you in the workspace as X_train
, X_test
, y_train
, y_test
. In addition, KNeighborsClassifier
has been imported from sklearn.neighbors
.
Instructions
k
.k
neighbors to the training data..score()
method and assign the results to the train_accuracy
and test_accuracy
arrays respectively.# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))
# Loop over different values of k
for i, k in enumerate(neighbors):
# Setup a k-NN Classifier with k neighbors: knn
knn = knn = KNeighborsClassifier(n_neighbors=k)
# Fit the classifier to the training data
knn.fit(X_train, y_train)
#Compute accuracy on the training set
train_accuracy[i] = knn.score(X_train, y_train)
#Compute accuracy on the testing set
test_accuracy[i] = knn.score(X_test, y_test)
# Generate plot
plt.title('KNN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()
It looks like the test accuracy is highest when using 3 and 5 neighbors. Using 8 neighbors or more seems to result in a simple model that underfits the data. Now that you've grasped the fundamentals of classification, you will learn about regression in the next chapter!
In the previous chapter, you used image and political datasets to predict binary and multiclass outcomes. But what if your problem requires a continuous outcome? Regression is best suited to solving such problems. You will learn about fundamental concepts in regression and apply them to predict the life expectancy in a given country using Gapminder data.
'CRIM'
is per capita crime rate'NX'
is nitric oxides concentration'RM'
is average number of rooms per dwelling'MEDV'
, is the median value of owner occupied homes in thousands of dollarsCreating feature and target arrays
features
and target
values in distinct arrays, X
and y
..values
attribute returns the NumPy
arrays.pandas
documentation recommends using .to_numpy
Predicting house value from a single feature
'RM'
.reshape
method to keep the first dimension, but add another dimension of size one to X
.Fitting a regression model
sklearn.linear_model.LinearRegression
np.linspace
between the max
and min
value of X_rooms
.boston = pd.read_csv(data_paths[1])
display(boston.head())
# creating features and target arrays
X = boston.drop('MEDV', axis=1).to_numpy()
y = boston.MEDV.to_numpy()
# predict from a single feature
X_rooms = X[:, 5]
# check variable type
print(f'X_rooms type: {type(X_rooms)}, shape: {X_rooms.shape}\ny type: {type(y)}, shape: {y.shape}')
# reshape
X_rooms = X_rooms.reshape(-1, 1)
y = y.reshape(-1, 1)
print(f'X_rooms shape: {X_rooms.shape}\ny shape: {y.shape}')
# instantiate model
reg = LinearRegression()
# fit a linear model
reg.fit(X_rooms, y)
# data range variable
pred_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1, 1)
# plot house value as a function of rooms
sns.scatterplot(data=boston, x='RM', y='MEDV', label='Data')
plt.plot(pred_space, reg.predict(pred_space), color='k', lw=3, label='Regression')
plt.legend(loc='lower right')
plt.xlabel('Number of Rooms')
plt.ylabel('Value of house /1000 ($)')
plt.show()
Andy introduced regression to you using the Boston housing dataset. But regression models can be used in a variety of contexts to solve a variety of different problems.
Given below are four example applications of machine learning. Your job is to pick the one that is best framed as a regression problem.
Answer the question
In this chapter, you will work with Gapminder data that we have consolidated into one CSV file available in the workspace as 'gapminder.csv'
. Specifically, your goal will be to use this data to predict the life expectancy in a given country based on features such as the country's GDP, fertility rate, and population. As in Chapter 1, the dataset has been preprocessed.
Since the target variable here is quantitative, this is a regression problem. To begin, you will fit a linear regression with just one feature: 'fertility'
, which is the average number of children a woman in a given country gives birth to. In later exercises, you will use all the features to build regression models.
Before that, however, you need to import the data and get it into the form needed by scikit-learn. This involves creating feature and target variable arrays. Furthermore, since you are going to use only one feature to begin with, you need to do some reshaping using NumPy's .reshape()
method. Don't worry too much about this reshaping right now, but it is something you will have to do occasionally when working with scikit-learn so it is useful to practice.
Instructions
numpy
and pandas
as their standard aliases.'gapminder.csv'
into a DataFrame df
using the read_csv()
function.X
for the 'fertility'
feature and array y
for the 'life'
target variable..reshape()
method and passing in -1
and 1
.# Read the CSV file into a DataFrame: df
df = pd.read_csv(data_paths[3])
# Create arrays for features and target variable
y = df.life.values
X = df.fertility.values
# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))
# Reshape X and y
y = y.reshape(-1, 1)
X = X.reshape(-1, 1)
# Print the dimensions of X and y after reshaping
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))
As always, it is important to explore your data before building models. On the right, we have constructed a heatmap showing the correlation between the different features of the Gapminder dataset, which has been pre-loaded into a DataFrame as df
and is available for exploration in the IPython Shell. Cells that are in green show positive correlation, while cells that are in red show negative correlation. Take a moment to explore this: Which features are positively correlated with life
, and which ones are negatively correlated? Does this match your intuition?
Then, in the IPython Shell, explore the DataFrame using pandas methods such as .info()
, .describe()
, .head()
.
In case you are curious, the heatmap was generated using Seaborn's heatmap function and the following line of code, where df.corr()
computes the pairwise correlation between columns:
sns.heatmap(df.corr(), square=True, cmap='RdYlGn')
Once you have a feel for the data, consider the statements below and select the one that is not true. After this, Hugo will explain the mechanics of linear regression in the next video and you will be on your way building regression models!
Instructions
139
samples (or rows) and 9
columns.life
and fertility
are negatively correlated.life
is 69.602878
.fertility
is of type int64
.GDP
and life
are positively correlated.sns.heatmap(df.corr(), square=True, cmap='RdYlGn')
Regression mechanics
The loss function
.fit
is called on a linear regression model in scikit-learn, it performs this OLS, under the hood.Linear regression in higher dimensions
.fit
method, one containing the features, the other is the target variable.Linear regression on all Boston Housing features
boston.head()
# split the data
X_train, X_test, y_train, y_test = train_test_split(boston.drop('MEDV', axis=1), boston.MEDV, test_size=0.3, random_state=42)
# instantiate the regressor
reg_all = LinearRegression()
# fit on the training set
reg_all.fit(X_train, y_train)
# predict on the test set
y_pred = reg_all.predict(X_test)
# score the model
score = reg_all.score(X_test, y_test)
print(f'Model Score: {score:0.3f}')
Now, you will fit a linear regression and predict life expectancy using just one feature. You saw Andy do this earlier using the 'RM'
feature of the Boston housing dataset. In this exercise, you will use the 'fertility'
feature of the Gapminder dataset. Since the goal is to predict life expectancy, the target variable here is 'life'
. The array for the target variable has been pre-loaded as y
and the array for 'fertility'
has been pre-loaded as X_fertility
.
A scatter plot with 'fertility'
on the x-axis and 'life'
on the y-axis has been generated. As you can see, there is a strongly negative correlation, so a linear regression should be able to capture this trend. Your job is to fit a linear regression and then predict the life expectancy, overlaying these predicted values on the plot to generate a regression line. You will also compute and print the $R^2$ score using sckit-learn's .score()
method.
Instructions
LinearRegression
from sklearn.linear_model
.LinearRegression
regressor called reg
.X_fertility
. This has been done for you.X_fertility
and y
) and compute its predictions using the .predict()
method and the prediction_space
array..score()
method.df.head()
X_fertility = df.fertility.to_numpy().reshape(-1, 1)
y = df.life.to_numpy().reshape(-1, 1)
# Create the regressor: reg
reg = LinearRegression()
# Create the prediction space
prediction_space = np.linspace(df.fertility.max(), df.fertility.min()).reshape(-1,1)
# Fit the model to the data
reg.fit(X_fertility, y)
# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)
# Print R^2
score = reg.score(X_fertility, y)
print(f'Score: {score}')
# Plot regression line
sns.scatterplot(data=df, x='fertility', y='life')
plt.xlabel('Fertility')
plt.ylabel('Life Expectancy')
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()
Notice how the line captures the underlying trend in the data. And the performance is quite decent for this basic regression model with only one feature.
As you learned in Chapter 1, train and test sets are vital to ensure that your supervised learning model is able to generalize well to new data. This was true for classification models, and is equally true for linear regression models.
In this exercise, you will split the Gapminder dataset into training and testing sets, and then fit and predict a linear regression over all features. In addition to computing the $R^2$ score, you will also compute the Root Mean Squared Error (RMSE), which is another commonly used metric to evaluate regression models. The feature array X
and target variable array y
have been pre-loaded for you from the DataFrame df
.
Instructions
LinearRegression
from sklearn.linear_model
, mean_squared_error
from sklearn.metrics
, and train_test_split
from sklearn.model_selection
.X
and y
, create training and test sets such that 30% is used for testing and 70% for training. Use a random state of 42
.reg_all
, fit it to the training set, and evaluate it on the test set..score()
method on the test set.mean_squared_error()
function with the arguments y_test
and y_pred
, and then take its square root using np.sqrt()
.X = df.drop(['life', 'Region'], axis=1).to_numpy()
y = df.life.to_numpy()
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
# Create the regressor: reg_all
reg_all = LinearRegression()
# Fit the regressor to the training data
reg_all.fit(X_train, y_train)
# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)
# Compute and print R^2 and RMSE
print(f"R^2: {reg_all.score(X_test, y_test):0.3f}")
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error: {rmse:0.3f}")
Using all features has improved the model score. This makes sense, as the model has more information to learn from. However, there is one potential pitfall to this process. Can you spot it? You'll learn about this, as well how to better validate your models, in the next section.
Cross-validation in scikit-learn
sklearn.model_selection.cross_val_score
cv_results
cv
parameter.mean
# instantiate the model
reg = LinearRegression()
# call cross_val_score
cv_results = cross_val_score(reg, boston.drop('MEDV', axis=1), boston.MEDV, cv=5)
print(f'Scores: {np.round(cv_results, 3)}')
print(f'Scores mean: {np.round(np.mean(cv_results), 3)}')
Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.
In this exercise, you will practice 5-fold cross validation on the Gapminder data. By default, scikit-learn's cross_val_score()
function uses $R^2$ as the metric of choice for regression. Since you are performing 5-fold cross-validation, the function will return 5 scores. Your job is to compute these 5 scores and then take their average.
The DataFrame has been loaded as df
and split into the feature/target variable arrays X
and y
. The modules pandas
and numpy
have been imported as pd
and np
, respectively.
Instructions
LinearRegression
from sklearn.linear_model
and cross_val_score
from sklearn.model_selection
.reg
.cross_val_score()
function to perform 5-fold cross-validation on X
and y
.mean()
function to compute the average.# Create a linear regression object: reg
reg = LinearRegression()
# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv=5)
# Print the 5-fold cross-validation scores
print(f'Scores: {np.round(cv_scores, 3)}')
print(f'Scores mean: {np.round(np.mean(cv_scores), 3)}')
Now that you have cross-validated your model, you can more confidently evaluate its predictions.
Cross validation is essential but do not forget that the more folds you use, the more computationally expensive cross-validation becomes. In this exercise, you will explore this for yourself. Your job is to perform 3-fold cross-validation and then 10-fold cross-validation on the Gapminder dataset.
In the IPython Shell, you can use %timeit
to see how long each 3-fold CV takes compared to 10-fold CV by executing the following cv=3
and cv=10
:
%timeit cross_val_score(reg, X, y, cv = ____)
pandas
and numpy
are available in the workspace as pd
and np
. The DataFrame has been loaded as df
and the feature/target variable arrays X
and y
have been created.
Instructions
LinearRegression
from sklearn.linear_model
and cross_val_score
from sklearn.model_selection
.reg
.# Create a linear regression object: reg
reg = LinearRegression()
# Perform 3-fold CV
cvscores_3 = cross_val_score(reg, X, y, cv=3)
print(f'cv=3 scores mean: {np.round(np.mean(cvscores_3), 3)}')
# Perform 10-fold CV
cvscores_10 = cross_val_score(reg, X, y, cv=10)
print(f'cv=10 scores mean: {np.round(np.mean(cvscores_10), 3)}')
cv3 = %timeit -n10 -r3 -q -o cross_val_score(reg, X, y, cv=3)
cv10 = %timeit -n10 -r3 -q -o cross_val_score(reg, X, y, cv=10)
print(f'cv=3 time: {cv3}\ncv=10 time: {cv10}')
Why regularize?
Ridge regression
k
in KNN
.Ridge regression in scikit-learn
sklearn.linear_model.Ridge
alpha
parameter.normalize
parameter to True
, ensures all the variables are on the same scale, which will be covered later in more depth.Lasso regression
Lasso regression in scikit-learn
sklearn.linear_model.Lasso
Lasso regression for feature selection
LASSO
algorithm.housing price
, is number of rooms, 'RM'
.# split the data
X_train, X_test, y_train, y_test = train_test_split(boston.drop('MEDV', axis=1), boston.MEDV, test_size=0.3, random_state=42)
# instantiate the model
ridge = Ridge(alpha=0.1, normalize=True)
# fit the model
ridge.fit(X_train, y_train)
# predict on the test data
ridge_pred = ridge.predict(X_test)
# get the score
rs = ridge.score(X_test, y_test)
print(f'Ridge Score: {round(rs, 4)}')
# split the data
X_train, X_test, y_train, y_test = train_test_split(boston.drop('MEDV', axis=1), boston.MEDV, test_size=0.3, random_state=42)
# instantiate the regressor
lasso = Lasso(alpha=0.1, normalize=True)
# fit the model
lasso.fit(X_train, y_train)
# predict on the test data
lasso_pred = lasso.predict(X_test)
# get the score
ls = lasso.score(X_test, y_test)
print(f'Ridge Score: {round(ls, 4)}')
# store the feature names
names = boston.drop('MEDV', axis=1).columns
# instantiate the regressor
lasso = Lasso(alpha=0.1)
# extract and store the coef attribute
lasso_coef = lasso.fit(boston.drop('MEDV', axis=1), boston.MEDV).coef_
plt.plot(range(len(names)), lasso_coef)
plt.xticks(range(len(names)), names, rotation=60)
plt.ylabel('Coefficients')
plt.grid()
plt.show()
In the video, you saw how Lasso selected out the 'RM'
feature as being the most important for predicting Boston house prices, while shrinking the coefficients of certain other features to 0. Its ability to perform feature selection in this way becomes even more useful when you are dealing with data involving thousands of features.
In this exercise, you will fit a lasso regression to the Gapminder data you have been working with and plot the coefficients. Just as with the Boston data, you will find that the coefficients of some features are shrunk to 0, with only the most important ones remaining.
The feature and target variable arrays have been pre-loaded as X
and y
.
Instructions
Lasso
from sklearn.linear_model
.0.4
and specify normalize=True
.coef_
attribute.# Instantiate a lasso regressor: lasso
lasso = Lasso(alpha=0.4, normalize=True)
# Fit the regressor to the data
lasso.fit(X, y)
# Compute and print the coefficients
lasso_coef = lasso.coef_
print(f'Lasso Coef: {lasso_coef}\n')
# Plot the coefficients
df_columns = df.drop(['life', 'Region'], axis=1).columns
plt.plot(range(len(df_columns)), lasso_coef)
plt.xticks(range(len(df_columns)), df_columns.values, rotation=60)
plt.margins(0.02)
plt.show()
According to the lasso algorithm, it seems like 'child_mortality'
is the most important feature when predicting life expectancy.
Lasso is great for feature selection, but when building regression models, Ridge regression should be your first choice.
Recall that lasso performs regularization by adding to the loss function a penalty term of the absolute value of each coefficient multiplied by some alpha. This is also known as $L1$ regularization because the regularization term is the $L1$ norm of the coefficients. This is not the only way to regularize, however.
If instead you took the sum of the squared values of the coefficients multiplied by some alpha - like in Ridge regression - you would be computing the $L2$ norm. In this exercise, you will practice fitting ridge regression models over a range of different alphas, and plot cross-validated $R^2$ scores for each, using this function that we have defined for you, which plots the $R^2$ score as well as standard error for each alpha:
Don't worry about the specifics of the above function works. The motivation behind this exercise is for you to see how the $R^2$ score varies with different alphas, and to understand the importance of selecting the right value for alpha. You'll learn how to tune alpha in the next chapter.
Instructions
normalize=True
.for
loop:X
and y
.np
.display_plot()
function to visualize the scores and standard deviations.def display_plot(cv_scores, cv_scores_std, alpha_space):
fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(1,1,1)
ax.plot(alpha_space, cv_scores, label='CV Scores')
std_error = cv_scores_std / np.sqrt(10)
ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, color='violet', alpha=0.2, label='CV Score ± std error')
ax.set_ylabel('CV Score +/- Std Error')
ax.set_xlabel('Alpha')
ax.axhline(np.max(cv_scores), linestyle='--', color='.5', label='Max CV Score')
ax.set_xlim([alpha_space[0], alpha_space[-1]])
ax.set_xscale('log')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []
# Create a ridge regressor: ridge
ridge = Ridge(normalize=True)
# Compute scores over range of alphas
for alpha in alpha_space:
# Specify the alpha value to use: ridge.alpha
ridge.alpha = alpha
# Perform 10-fold CV: ridge_cv_scores
ridge_cv_scores = cross_val_score(ridge, X, y, cv=10)
# Append the mean of ridge_cv_scores to ridge_scores
ridge_scores.append(np.mean(ridge_cv_scores))
# Append the std of ridge_cv_scores to ridge_scores_std
ridge_scores_std.append(np.std(ridge_cv_scores))
# Display the plot
display_plot(ridge_scores, ridge_scores_std, alpha_space)
Notice how the cross-validation scores change with different alphas. Which alpha should you pick? How can you fine-tune your model? You'll learn all about this in the next chapter!
Having trained your model, your next task is to evaluate its performance. In this chapter, you will learn about some of the other metrics available in scikit-learn that will allow you to assess your model's performance in a more nuanced manner. Next, learn to optimize your classification and regression models using hyperparameter tuning.
Class imbalance example: Emails
Diagnosing classification predictions
sklearn.metrics.confusion_matrix
Metrics from the confusion matrix
# using the voting dataset from 1.3.1
v_na.head()
# instantiate the model
knn = KNeighborsClassifier(n_neighbors=8)
# split the voting data
X_train, X_test, y_train, y_test = train_test_split(v_na.drop(['party'], axis=1), v_na.party, test_size=0.4, random_state=42)
# fit the training data
knn.fit(X_train, y_train)
# predict the labels fo the test set
y_pred = knn.predict(X_test)
# confusion_matrix
print(f'Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}\n')
# classification report
print(f'Classification Report: \n{classification_report(y_test, y_pred)}')
In Chapter 1, you evaluated the performance of your k-NN classifier based on its accuracy. However, as Andy discussed, accuracy is not always an informative metric. In this exercise, you will dive more deeply into evaluating the performance of binary classifiers by computing a confusion matrix and generating a classification report.
You may have noticed in the video that the classification report consisted of three rows, and an additional support column. The support gives the number of samples of the true response that lie in that class - so in the video example, the support was the number of Republicans or Democrats in the test set on which the classification report was computed. The precision, recall, and f1-score columns, then, gave the respective metrics for that particular class.
Here, you'll work with the PIMA Indians diabetes dataset obtained from the UCI Machine Learning Repository. The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0
indicates that the patient does not have diabetes, while a value of 1
indicates that the patient does have diabetes. As in Chapters 1 and 2, the dataset has been preprocessed to deal with missing values.
The dataset has been loaded into a DataFrame df
and the feature and target variable arrays X
and y
have been created for you. In addition, sklearn.model_selection.train_test_split
and sklearn.neighbors.KNeighborsClassifier
have already been imported.
Your job is to train a k-NN classifier to the data and evaluate its performance by generating a confusion matrix and classification report.
Instructions
classification_report
and confusion_matrix
from sklearn.metrics
.42
.6
neighbors, fit it to the training data, and predict the labels of the test set.confusion_matrix()
and classification_report()
functions.df = pd.read_csv(data_paths[2])
X = df.drop('diabetes', axis=1)
y = df.diabetes
df.head()
# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# Instantiate a k-NN classifier: knn
knn = knn = KNeighborsClassifier(n_neighbors=6)
# Fit the classifier to the training data
knn.fit(X_train, y_train)
# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)
# Generate the confusion matrix and classification report
print(f'Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}\n')
print(f'Classification Report: \n{classification_report(y_test, y_pred)}')
By analyzing the confusion matrix and classification report, you can get a much better understanding of your classifier's performance.
Logistic regression for binary classification
'1'
, for less than 0.5, we label it '0'
.Logistic regression in scikit-learn
'1'
for all the data, which means the true positive rate is equal to the false positive rate, is equal to one.'0'
for all the data, which means that both true and false positive rates are 0.'1'
to the observation in question..predict_proba
to the model and pass it the test data.'1'
.sklearn.metrics.roc_curve
# instantiate the model
logreg = LogisticRegression()
# split the voting data
X_train, X_test, y_train, y_test = train_test_split(v_na.drop(['party'], axis=1), v_na.party, test_size=0.4, random_state=42)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
y_pred_prob = logreg.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob, pos_label='republican')
plt.figure(figsize=(6, 6))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.legend()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show()
Time to build your first logistic regression model! As Hugo showed in the video, scikit-learn makes it very easy to try different models, since the Train-Test-Split/Instantiate/Fit/Predict paradigm applies to all classifiers and regressors - which are known in scikit-learn as 'estimators'. You'll see this now for yourself as you train a logistic regression model on exactly the same data as in the previous exercise. Will it outperform k-NN? There's only one way to find out!
The feature and target variable arrays X
and y
have been pre-loaded, and train_test_split
has been imported for you from sklearn.model_selection
.
Instructions
LogisticRegression
from sklearn.linear_model
.confusion_matrix
and classification_report
from sklearn.metrics
.0.4
) of the data used for testing. Use a random state of 42
. This has been done for you.LogisticRegression
classifier called logreg
.# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)
# Create the classifier: logreg
logreg = LogisticRegression(max_iter=150)
# Fit the classifier to the training data
logreg.fit(X_train, y_train)
# Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)
# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
You now know how to use logistic regression for binary classification - great work! Logistic regression is used in a variety of machine learning applications and will become a vital part of your data science toolbox.
Classification reports and confusion matrices are great methods to quantitatively evaluate model performance, while ROC curves provide a way to visually evaluate models. As Hugo demonstrated in the video, most classifiers in scikit-learn have a .predict_proba()
method which returns the probability of a given sample being in a particular class. Having built a logistic regression model, you'll now evaluate its performance by plotting an ROC curve. In doing so, you'll make use of the .predict_proba()
method and become familiar with its functionality.
Here, you'll continue working with the PIMA Indians diabetes dataset. The classifier has already been fit to the training data and is available as logreg
.
Instructions
from sklearn.metrics import roc_curve
.logreg
classifier, which has been fit to the training data, compute the predicted probabilities of the labels of the test set X_test
. Save the result as y_pred_prob
.roc_curve()
function with y_test
and y_pred_prob
and unpack the result into the variables fpr
, tpr
, and thresholds
.fpr
on the x-axis and tpr
on the y-axis.# Create the classifier: logreg
logreg = LogisticRegression(multi_class='ovr', n_jobs=1, solver='liblinear')
# Fit the classifier to the training data
logreg.fit(X_train, y_train)
# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]
# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
# Plot ROC curve
plt.subplots(figsize=(4.5, 4.5))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
This ROC curve provides a nice visual way to assess your classifier's performance.
When looking at your ROC curve, you may have noticed that the y-axis (True positive rate) is also known as recall. Indeed, in addition to the ROC curve, there are other ways to visually evaluate model performance. One such way is the precision-recall curve, which is generated by plotting the precision and recall for different thresholds. As a reminder, precision and recall are defined as:
$Precision=\frac{t_{p}}{t_{p}+f_{p}}$
$Recall=\frac{t_{p}}{t_{p}+f_{n}}$
On the right, a precision-recall curve has been generated for the diabetes dataset. The classification report and confusion matrix are displayed in the IPython Shell.
Study the precision-recall curve and then consider the statements given below. Choose the one statement that is not true. Note that here, the class is positive (1) if the individual has diabetes.
lr_precision, lr_recall, _ = precision_recall_curve(y_test, y_pred_prob)
no_skill = len(y_test[y_test==1]) / len(y_test)
plt.subplots(figsize=(5, 5))
plt.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
plt.plot(lr_recall, lr_precision, marker='.', label='Logistic')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend()
plt.show()
Instructions
scikit-learn
from sklearn.metrics import roc_auc_score
v_na.head(3)
# instantiate the classifier
logreg = LogisticRegression()
# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(v_na.iloc[:, 1:], v_na.iloc[:, 0], test_size=0.4, random_state=42)
# fit the model to the train data
logreg.fit(X_train, y_train)
# compute the predicted probabilites
y_pred_prob = logreg.predict_proba(X_test)[:,1]
# pass the True labels and the predicted probabilites to roc_auc_score
roc_auc_score(y_test, y_pred_prob)
sklearn.model_selection.cross_val_score
from sklearn.model_selection import cross_val_score
# pass the estimator, features, and target, to cross_val_score, and use scoring='roc_auc'
cv_scores = cross_val_score(logreg, v_na.iloc[:, 1:], v_na.iloc[:, 0], cv=5, scoring='roc_auc')
cv_scores
Say you have a binary classifier that in fact is just randomly making guesses. It would be correct approximately 50% of the time, and the resulting ROC curve would be a diagonal line in which the True Positive Rate and False Positive Rate are always equal. The Area under this ROC curve would be 0.5. This is one way in which the AUC, which Hugo discussed in the video, is an informative metric to evaluate a model. If the AUC is greater than 0.5, the model is better than random guessing. Always a good sign!
In this exercise, you'll calculate AUC scores using the roc_auc_score()
function from sklearn.metrics
as well as by performing cross-validation on the diabetes dataset.
X and y, along with training and test sets X_train
, X_test
, y_train
, y_test
, have been pre-loaded for you, and a logistic regression classifier logreg
has been fit to the training data.
Instructions
roc_auc_score
from sklearn.metrics
and cross_val_score
from sklearn.model_selection
.logreg
classifier, which has been fit to the training data, compute the predicted probabilities of the labels of the test set X_test
. Save the result as y_pred_prob
.roc_auc_score()
function, the test set labels y_test
, and the predicted probabilities y_pred_prob
.cross_val_score()
function and specify the scoring
parameter to be 'roc_auc'
.# instantiate the classifier
logreg = LogisticRegression(multi_class='ovr', n_jobs=1, solver='liblinear')
# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# fit the model to the train data
logreg.fit(X_train, y_train)
# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]
# Compute and print AUC score
print(f"AUC: {roc_auc_score(y_test, y_pred_prob)}")
# Compute cross-validated AUC scores: cv_auc
cv_auc = cross_val_score(logreg, X, y, cv=5, scoring='roc_auc')
# Print list of AUC scores
print(f"AUC scores computed using 5-fold cross-validation: {cv_auc}")
# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
# Plot ROC curve
plt.subplots(figsize=(4.5, 4.5))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
You now have a number of different methods you can use to evaluate your model's performance.
alpha
in ridge and lasso regression before fitting it. n_neighbors
must be chosen.train-test-split
alone, would risk overfitting the hyperparameter to the test set.C
and alpha
, the grid values to test could look like the following:scikit-learn
we implement it using the class GridSearchCV
from sklearn.model_selection import GridSearchCV
dictionary
in which the keys
are the hyperparameter names, such as n_neighbors
in KNN or alpha
in lasso regression.values
in the grid dictionary
are lists
containing the values we wish to tune the relevant hyperparameter(s) over.GridSearchCV
returns a GridSearchCV
object, which is fit to the data, and this fit performs the grid search inplace..best_params_
to retrieve the parameters that perform the best.best_score_
returns the mean cross-validation score over that fold.# specify the hyperparameter(s) as a dictionary
param_grid = {'n_neighbors': np.arange(1, 50)}
# instantiate the classifier
knn = KNeighborsClassifier()
# use GridSeachCV and pass in the model, the grid to turn over, and the number of folds
knn_cv = GridSearchCV(knn, param_grid, cv=5)
# fit and permord the grid search
knn_cv.fit(v_na.iloc[:, 1:], v_na.iloc[:, 0])
bp = knn_cv.best_params_
bs = round(knn_cv.best_score_, 2)
print(f'Best Parameters: {bp}\nBest Score: {bs}')
Hugo demonstrated how to tune the n_neighbors
parameter of the KNeighborsClassifier()
using GridSearchCV
on the voting dataset. You will now practice this yourself, but by using logistic regression on the diabetes dataset instead!
Like the alpha
parameter of lasso and ridge regularization that you saw earlier, logistic regression also has a regularization parameter:C
. C
controls the inverse of the regularization strength, and this is what you will tune in this exercise. A large C
can lead to an overfit model, while a small C
can lead to an underfit model.
The hyperparameter space for has been setup for you. Your job is to use GridSearchCV and logistic regression to find the optimal in this hyperparameter space. The feature array is available as X
and target variable array is available as y
.
You may be wondering why you aren't asked to split the data into training and test sets. Good observation! Here, we want you to focus on the process of setting up the hyperparameter grid and performing grid-search cross-validation. In practice, you will indeed want to hold out a portion of your data for evaluation purposes, and you will learn all about this in the next video!
Instructions
LogisticRegression
from sklearn.linear_model
and GridSearchCV
from sklearn.model_selection
.c_space
as the grid of values to tune C
over.logreg
.GridSearchCV
with 5-fold cross-validation to tune C
:GridSearchCV()
, specify the classifier, parameter grid, and number of folds to use..fit()
method on the GridSearchCV
object to fit it to the data X
and y
.GridSearchCV
by accessing the best_params_
and best_score_
attributes of logreg_cv
.# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'max_iter': range(200, 1000, 200)}
# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()
# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)
# Fit it to the data
logreg_cv.fit(X,y)
bp = logreg_cv.best_params_
bs = round(logreg_cv.best_score_, 2)
# Print the tuned parameters and score
print(f"Tuned Logistic Regression Parameters: {bp}")
print(f"Best score is {bs}")
It looks like a C
of 0
results in the best performance.
GridSearchCV
can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use RandomizedSearchCV
, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions. You'll practice using RandomizedSearchCV
in this exercise and see how this works.
Here, you'll also be introduced to a new model: the Decision Tree. Don't worry about the specifics of how this model works. Just like k-NN, linear regression, and logistic regression, decision trees in scikit-learn have .fit()
and .predict()
methods that you can use in exactly the same way as before. Decision trees have many parameters that can be tuned, such as max_features
, max_depth
, and min_samples_leaf
: This makes it an ideal use case for RandomizedSearchCV
.
As before, the feature array X
and target variable array y
of the diabetes dataset have been pre-loaded. The hyperparameter settings have been specified for you. Your goal is to use RandomizedSearchCV
to find the optimal hyperparameters. Go for it!
Instructions
DecisionTreeClassifier
from sklearn.tree
and RandomizedSearchCV
from sklearn.model_selection
.DecisionTreeClassifier
.RandomizedSearchCV
with 5-fold cross-validation to tune the hyperparameters:RandomizedSearchCV()
, specify the classifier, parameter distribution, and number of folds to use..fit()
method on the RandomizedSearchCV
object to fit it to the data X and y.best_params_
and best_score_
attributes of tree_cv
.# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
"max_features": randint(1, 9),
"min_samples_leaf": randint(1, 9),
"criterion": ["gini", "entropy"]}
# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()
# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(estimator=tree, param_distributions=param_dist, cv=5)
# Fit it to the data
tree_cv.fit(X, y)
# Print the tuned parameters and score
bp = tree_cv.best_params_
bs = round(tree_cv.best_score_, 2)
print(f"Tuned Decision Tree Parameters: {bp}")
print(f"Best score is {bs}")
You'll see a lot more of decision trees and RandomizedSearchCV
as you continue your machine learning journey. Note that RandomizedSearchCV
will never outperform GridSearchCV
. Instead, it is valuable because it saves on computation time.
train-test-split
and GridSearchCV
will be useful.For which of the following reasons would you want to use a hold-out set for the very end?
Possible Answers
You will now practice evaluating a model with tuned hyperparameters on a hold-out set. The feature array and target variable array from the diabetes dataset have been pre-loaded as X
and y
.
In addition to C
, logistic regression has a 'penalty'
hyperparameter which specifies whether to use 'l1'
or 'l2'
regularization. Your job in this exercise is to create a hold-out set, tune the 'C'
and 'penalty'
hyperparameters of a logistic regression classifier using GridSearchCV
on the training set.
Instructions
c_space
as the grid of values for 'C'
.'penalty'
, specify a list consisting of 'l1'
and 'l2'
.test_size
of 0.4
and random_state
of 42
. In practice, the test set here will function as the hold-out set.GridSearchCV
with 5-folds. This involves first instantiating the GridSearchCV
object with the correct parameters and then fitting it to the training data.GridSearchCV
by accessing the best_params_
and best_score_
attributes of logreg_cv
.# Create the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}
# Instantiate the logistic regression classifier: logreg
logreg = LogisticRegression(solver='liblinear', multi_class='ovr', n_jobs=1)
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)
# Fit it to the training data
logreg_cv.fit(X_train, y_train)
# Print the tuned parameters and score
bp = logreg_cv.best_params_
bs = round(logreg_cv.best_score_, 2)
# Print the optimal parameters and best score
print(f"Tuned Logistic Regression Parameter: {bp}")
print(f"Tuned Logistic Regression Accuracy: {bs}")
Remember lasso and ridge regression from the previous chapter? Lasso used the $L1$ penalty to regularize, while ridge used the $L2$ penalty. There is another type of regularized regression known as the elastic net. In elastic net regularization, the penalty term is a linear combination of the $L1$ and $L2$ penalties:
In scikit-learn, this term is represented by the 'l1_ratio'
parameter: An 'l1_ratio'
of 1
corresponds to an $L1$ penalty, and anything lower is a combination of $L1$ and $L2$.
In this exercise, you will GridSearchCV
to tune the 'l1_ratio'
of an elastic net model trained on the Gapminder data. As in the previous exercise, use a hold-out set to evaluate your model's performance.
Instructions
ElasticNet
from sklearn.linear_model
.mean_squared_error
from sklearn.metrics
.GridSearchCV
and train_test_split
from sklearn.model_selection
.42
.'l1_ratio'
using l1_space
as the grid of values to search over.ElasticNet
regressor.GridSearchCV
with 5-fold cross-validation to tune 'l1_ratio'
on the training data X_train
and y_train
. This involves first instantiating the GridSearchCV
object with the correct parameters and then fitting it to the training data.fertility = pd.read_csv(data_paths[3])
y_life = fertility.pop('life')
X_fer = fertility.iloc[:, :-1]
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_fer, y_life, test_size=0.4, random_state=42)
# Create the hyperparameter grid
l1_space = np.linspace(0, 1, 30)
param_grid = {'l1_ratio': l1_space}
# Instantiate the ElasticNet regressor: elastic_net
elastic_net = ElasticNet()
# Setup the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(estimator=elastic_net, param_grid=param_grid, cv=5)
# Fit it to the training data
gm_cv.fit(X_train, y_train)
# Predict on the test set and compute metrics
y_pred = gm_cv.predict(X_test)
r2 = gm_cv.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Tuned ElasticNet l1 ratio: {gm_cv.best_params_}")
print(f"Tuned ElasticNet R squared: {r2}")
print(f"Tuned ElasticNet MSE: {mse}")
Now that you understand how to fine-tune your models, it's time to learn about preprocessing techniques and how to piece together all the different stages of the machine learning process into a pipeline!
This chapter introduces pipelines, and how scikit-learn allows for transformers and estimators to be chained together and used as a single unit. Preprocessing techniques will be introduced as a way to enhance model performance, and pipelines will tie together concepts from previous chapters.
Dealing with Categorical Features
0
means the observation was not that category, while 1
means it was.Dummy Variables
'origin'
feature with three different possible values: 'US'
, 'Asia'
, and 'Europe'
.1
in exactly one of the three columns and 0
in the other two.'US'
and not from 'Asia'
, then implicitly, it is from 'Europe'
.'Europe'
column.OneHotEncoder()
get_dummies()
Automobile Dataset
'origin'
is our one categorical featurepd.get_dummies
creates three new binary features.drop_first=True
, or milage.drop('origin_Asia, axis=1)
to remove the first dummy columnmilage = pd.read_csv(datasets[0])
milage.head(3)
origin = sorted(milage.origin.unique().tolist())
origin
sns.boxplot(data=milage, y='mpg', x='origin', order=origin)
plt.show()
milage = pd.get_dummies(milage, drop_first=True)
milage.head()
# fit a ridge regression mode and compute its R^2
X_train, X_test, y_train, y_test = train_test_split(milage.iloc[:, 1:], milage.mpg, test_size=0.3, random_state=42)
ridge = Ridge(alpha=0.5, normalize=True).fit(X_train, y_train)
ridge.score(X_test, y_test)
ridge.coef_
The Gapminder dataset that you worked with in previous chapters also contained a categorical 'Region'
feature, which we dropped in previous exercises since you did not have the tools to deal with it. Now however, you do, so we have added it back in!
Your job in this exercise is to explore this feature. Boxplots are particularly useful for visualizing categorical features such as this.
Instructions
pandas
as pd
.'gapminder.csv'
into a DataFrame called df
.'life'
) by region ('Region'
). To do so, pass the column names in to df.boxplot()
(in that order).fertility = pd.read_csv(datasets[3])
fertility.head(2)
plt.style.use('ggplot')
fertility.boxplot(column='life', by='Region', figsize=(9, 5), rot=60)
Exploratory data analysis should always be the precursor to model building.
As Andy discussed in the video, scikit-learn does not accept non-numerical features. You saw in the previous exercise that the 'Region'
feature contains very useful information that can predict life expectancy. For example, Sub-Saharan Africa has a lower life expectancy compared to Europe and Central Asia. Therefore, if you are trying to predict life expectancy, it would be preferable to retain the 'Region'
feature. To do this, you need to binarize it by creating dummy variables, which is what you will do in this exercise.
Instructions
get_dummies()
function to create dummy variables from the df DataFrame. Store the result as df_region
.df_region
. This has been done for you.get_dummies()
function again, this time specifying drop_first=True
to drop the unneeded dummy variable (in this case, 'Region_America'
).df_region
and take note of how one column was dropped!# Create dummy variables: df_region
df_region = pd.get_dummies(fertility)
# Print the columns of df_region
display(df_region.head(2))
# Create dummy variables with drop_first=True: df_region
df_region = pd.get_dummies(fertility, drop_first=True)
# Print the new columns of df_region
display(df_region.head(2))
Now that you have created the dummy variables, you can use the 'Region'
feature to predict life expectancy!
Having created the dummy variables from the 'Region'
feature, you can build regression models as you did before. Here, you'll use ridge regression to perform 5-fold cross-validation.
The feature array X
and target variable array y
have been pre-loaded.
Instructions
Ridge
from sklearn.linear_model
and cross_val_score
from sklearn.model_selection
.ridge
with alpha=0.5
and normalize=True
.X
and y
using the cross_val_score()
function.life = df_region.pop('life')
# Instantiate a ridge regressor: ridge
ridge = Ridge(alpha=0.5, normalize=True)
# Perform 5-fold cross-validation: ridge_cv
ridge_cv = cross_val_score(estimator=ridge, X=df_region, y=life, cv=5)
# Print the cross-validated scores
ridge_cv
You now know how to build models using data that includes categorical features.
PIMA Indians Diabetes Dataset
df.info()
, all features have 768 non-null entries<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pregnancies 768 non-null int64
1 glucose 768 non-null int64
2 diastolic 768 non-null int64
3 triceps 768 non-null int64
4 insulin 768 non-null int64
5 bmi 768 non-null float64
6 dpf 768 non-null float64
7 age 768 non-null int64
8 diabetes 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
0
, '?'
, or -1
.df.head()
it looks as though there are observations where insulin is 0
.'triceps'
, which is the thickness of the skin on the back of the arm, is 0
.Dropping Missing Data
NaN
using the replace method on the relevant columns.df.dropna()
pima
dataset, using pima.dropna()
results in losing nearly half of the data. The shape changes from (768, 9)
to (393, 9)
.NaN
values.Imputing Missing Data
RandomForestRegressor
and pandas.DataFrame.groupby
to impute missing data in the 'age'
column of the Titanic dataset.from sklearn.impute import SimpleImputer
not from sklearn.preprocessing import Imputer
pima = pd.read_csv(datasets[2])
pima.head()
# distribution with missing data encoded as 0; not triceps and insulin
pima_stacked = pima.iloc[:, :-1].stack().reset_index().drop('level_0', axis=1).rename({'level_1': 'Category', 0: 'Value'}, axis=1)
sns.catplot(data=pima_stacked, kind='box', col='Category', col_wrap=4, y='Value', sharey=False, height=3, color='lightblue')
# replace erroneously encoded data with NaN
pima.insulin.replace(0, np.nan, inplace=True)
pima.triceps.replace(0, np.nan, inplace=True)
pima.bmi.replace(0, np.nan, inplace=True)
pima.info()
# distribution with missing data encoded as NaN
pima_stacked = pima.iloc[:, :-1].stack().reset_index().drop('level_0', axis=1).rename({'level_1': 'Category', 0: 'Value'}, axis=1)
sns.catplot(data=pima_stacked, kind='box', col='Category', col_wrap=4, y='Value', sharey=False, height=3, color='lightblue')
# instantiate the imputer; mean is the default and imputation is along the column
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
# fit the imputer
imp_mean.fit(pima.iloc[:, :-1])
# update the pima values
pima.iloc[:, :-1] = imp_mean.transform(pima.iloc[:, :-1])
pima.describe()
# distribution with missing data imputed as mean
pima_stacked = pima.iloc[:, :-1].stack().reset_index().drop('level_0', axis=1).rename({'level_1': 'Category', 0: 'Value'}, axis=1)
sns.catplot(data=pima_stacked, kind='box', col='Category', col_wrap=4, y='Value', sharey=False, height=3, color='lightblue')
Imputing Within a Pipeline
from sklearn.pipeline import Pipeline
pima = pd.read_csv(datasets[2])
pima.insulin.replace(0, np.nan, inplace=True)
pima.triceps.replace(0, np.nan, inplace=True)
pima.bmi.replace(0, np.nan, inplace=True)
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
# instantiate the model
logreg = LogisticRegression()
# build the pipeline object by creating a list of steps in the pipeline
steps = [('imputation', imp_mean), ('logistic_regression', logreg)]
# and then pass the list to the Pipeline constructor
pipeline = Pipeline(steps)
# split the data
X_train, X_test, y_train, y_test = train_test_split(pima.iloc[:, :-1], pima.iloc[:, -1], test_size=0.3, random_state=42)
# fit the pipeline to the training set
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
r2 = round(pipeline.score(X_test, y_test), 2)
mse = round(mean_squared_error(y_test, y_pred), 2)
print(f"R^2: {r2}")
print(f"Mean Square Error: {mse}")
The voting dataset from Chapter 1 contained a bunch of missing values that we dealt with for you behind the scenes. Now, it's time for you to take care of these yourself!
The unprocessed dataset has been loaded into a DataFrame df
. Explore it in the IPython Shell with the .head()
method. You will see that there are certain data points labeled with a '?'
. These denote missing values. As you saw in the video, different datasets encode missing values in different ways. Sometimes it may be a '9999'
, other times a 0
- real-world data can be very messy! If you're lucky, the missing values will already be encoded as NaN
. We use NaN
because it is an efficient and simplified way of internally representing missing data, and it lets us take advantage of pandas methods such as .dropna()
and .fillna()
, as well as scikit-learn's Imputation transformer Imputer()
.
In this exercise, your job is to convert the '?'
s to NaNs, and then drop the rows that contain them from the DataFrame.
Instructions
df
in the IPython Shell. Notice how the missing value is represented.'?'
data points to np.nan
..isnull()
and .sum()
methods. This has been done for you.df
using .dropna()
.cols = ['party', 'infants', 'water', 'budget', 'physician', 'salvador', 'religious', 'satellite', 'aid',
'missile', 'immigration', 'synfuels', 'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa']
df = pd.read_csv(datasets[4], header=None, names=cols)
df.iloc[:, 1:] = df.iloc[:, 1:].replace({'n': 0, 'y': 1})
df.head(2)
# Convert '?' to NaN
df[df == '?'] = np.nan
# Print the number of NaNs
display(df.isnull().sum().to_frame().rename({0: 'NaN_Count'}, axis=1))
# Print shape of original DataFrame
print(f"Shape of Original DataFrame: {df.shape}")
# Drop missing values and print shape of new DataFrame
df = df.dropna()
# Print shape of new DataFrame
print(f"Shape of DataFrame After Dropping All Rows with Missing Values: {df.shape}")
When many values in your dataset are missing, if you drop them, you may end up throwing away valuable information along with the missing data. It's better instead to develop an imputation strategy. This is where domain knowledge is useful, but in the absence of it, you can impute missing values with the mean or the median of the row or column that the missing value is in.
As you've come to appreciate, there are many steps to building a model, from creating training and test sets, to fitting a classifier or regressor, to tuning its parameters, to evaluating its performance on new data. Imputation can be seen as the first step of this machine learning process, the entirety of which can be viewed within the context of a pipeline. Scikit-learn provides a pipeline constructor that allows you to piece together these steps into one process and thereby simplify your workflow.
You'll now practice setting up a pipeline with two steps: the imputation step, followed by the instantiation of a classifier. You've seen three classifiers in this course so far: k-NN, logistic regression, and the decision tree. You will now be introduced to a fourth one - the Support Vector Machine, or SVM. For now, do not worry about how it works under the hood. It works exactly as you would expect of the scikit-learn estimators that you have worked with previously, in that it has the same .fit()
and .predict()
methods as before.
Instructions
Imputer
from sklearn.preprocessing
and SVC
from sklearn.svm
. SVC stands for Support Vector Classification, which is a type of SVM.Imputer
has been replaced by SimpleImputer
and there is no axis
parameter.'NaN'
) with the 'most_frequent'
value in the column (axis=0
).SVC
classifier. Store the result in clf
.imp
.# Setup the Imputation transformer: imp
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
# Instantiate the SVC classifier: clf
clf = SVC()
# Setup the pipeline with the required steps: steps
steps = [('imputation', imp),
('SVM', clf)]
Having setup the steps of the pipeline in the previous exercise, you will now use it on the voting dataset to classify a Congressman's party affiliation. What makes pipelines so incredibly useful is the simple interface that they provide. You can use the .fit()
and .predict()
methods on pipelines just as you did with your classifiers and regressors!
Practice this for yourself now and generate a classification report of your predictions. The steps of the pipeline have been set up for you, and the feature array X
and target variable array y
have been pre-loaded. Additionally, train_test_split
and classification_report
have been imported from sklearn.model_selection
and sklearn.metrics
respectively.
Instructions
Imputer
from sklearn.preprocessing
and Pipeline
from sklearn.pipeline
.SVC
from sklearn.svm
.Pipeline()
and steps
.42
.cols = ['party', 'infants', 'water', 'budget', 'physician', 'salvador', 'religious', 'satellite', 'aid',
'missile', 'immigration', 'synfuels', 'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa']
df = pd.read_csv(datasets[4], header=None, names=cols)
df.iloc[:, 1:] = df.iloc[:, 1:].replace({'n': 0, 'y': 1})
df.head(2)
df.eaa_rsa.value_counts()
# Setup the pipeline steps: steps
steps = [('imputation', SimpleImputer(missing_values='?', strategy='most_frequent')), ('SVM', SVC())]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:], df.party, test_size=0.3, random_state=42)
# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)
# Predict the labels of the test set
y_pred = pipeline.predict(X_test)
# Compute metrics
print(classification_report(y_test, y_pred))
Centering and scaling
Why scale your data?
df.describe()
to check out the ranges of the feature variables in the red wine quality dataset.'acidity'
, 'pH'
, and 'alcohol'
content.'density'
varies from (point) 99 to to 1 and 'total sulfur dioxide'
from 6 to 289!Why scale your data?
Ways to normalize your data
# create the red wine dataset
rw = pd.read_csv(datasets[6], sep=';')
rw.head()
# set the quality column as a binary target, with values < 5 = 0, otherwise 1
rw.quality = np.where(rw.quality < 5, 1, 0)
rw.head(2)
rw.describe()
# stack feature columns
rws = rw.iloc[:, :-1].stack().reset_index(name='Values').drop(['level_0'], axis=1).rename({'level_1': 'Category'}, axis=1)
sns.catplot(data=rws, col='Category', col_wrap=4, y='Values', kind='box', sharey=False, height=3, aspect=1.25, color='lightgreen')
# select X and y - drop rows based on columns used in the video
cols = ['fixed acidity', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']
X = rw[cols].to_numpy()
y = rw.quality.astype(bool)
# scale the features
X_scaled = scale(X)
# Print the mean and standard deviation of the unscaled features
print(f"Mean of Unscaled Features: {np.mean(X)}")
print(f"Standard Deviation of Unscaled Features: {np.std(X)}")
# Print the mean and standard deviation of the scaled features
print(f"\nMean of Scaled Features: {np.mean(X_scaled)}")
print(f"Standard Deviation of Scaled Features: {np.mean(X_scaled)}")
import StandardScaler from sklearn.reprocessing
and build a pipeline object as we did earlier; here we'll use a K-nearest neighbors algorithm.1: This notebook was created on 2021-03-13 with `scikit-learn v0.24.1`, which does not show a difference between the scaled and unscaled data.
# scaling in a pipeline
steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
knn_scaled = pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy_score(y_test, y_pred)
# unscale model fitting
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)
knn_unscaled.score(X_test, y_test)
CV and scaling in a pipeline
Scaling and CV in a pipeline
steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)
parameters = {'knn__n_neighbors': np.arange(1, 50)}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=56)
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
print(cv.best_params_)
print(cv.score(X_test, y_test))
print(classification_report(y_test, y_pred))
y_train.value_counts()
y_test.value_counts()
In the video, Hugo demonstrated how significantly the performance of a model can improve if the features are scaled. Note that this is not always the case: In the Congressional voting records dataset, for example, all of the features are binary. In such a situation, scaling will have minimal impact.
You will now explore scaling for yourself on a new dataset - White Wine Quality! Hugo used the Red Wine Quality dataset in the video. We have used the 'quality'
feature of the wine to create a binary target variable: If 'quality'
is less than 5
, the target variable is 1
, and otherwise, it is 0
.
The DataFrame has been pre-loaded as df
, along with the feature and target variable arrays X
and y
. Explore it in the IPython Shell. Notice how some features seem to have different units of measurement. 'density'
, for instance, takes values between 0.98 and 1.04, while 'total sulfur dioxide'
ranges from 9 to 440. As a result, it may be worth scaling the features here. Your job in this exercise is to scale the features and compute the mean and standard deviation of the unscaled features compared to the scaled features.
Instructions
scale
from sklearn.preprocessing
.X
using scale()
.X
, and then the scaled features X_scaled
. Use the numpy functions np.mean()
and np.std()
to compute the mean and standard deviations.ww = pd.read_csv(datasets[5])
ww.quality = np.where(ww.quality < 5, 1, 0)
ww.head()
wws = rw.iloc[:, :-1].stack().reset_index(name='Values').drop(['level_0'], axis=1).rename({'level_1': 'Category'}, axis=1)
sns.catplot(data=wws, col='Category', col_wrap=4, y='Values', kind='box', sharey=False, height=3, aspect=1.25, color='orchid')
X = ww.iloc[:, :-1].to_numpy()
y = ww.quality.astype(bool).to_numpy()
# Scale the features: X_scaled
X_scaled = scale(X)
# Print the mean and standard deviation of the unscaled features
print(f"Mean of Unscaled Features: {np.mean(X)}")
print(f"Standard Deviation of Unscaled Features: {np.std(X)}")
# Print the mean and standard deviation of the scaled features
print(f"\nMean of Scaled Features: {np.mean(X_scaled)}")
print(f"Standard Deviation of Scaled Features: {np.std(X_scaled)}")
Notice the difference in the mean and standard deviation of the scaled features compared to the unscaled features.
With regard to whether or not scaling is effective, the proof is in the pudding! See for yourself whether or not scaling the features of the White Wine Quality dataset has any impact on its performance. You will use a k-NN classifier as part of a pipeline that includes scaling, and for the purposes of comparison, a k-NN classifier trained on the unscaled data has been provided.
The feature array and target variable array have been pre-loaded as X
and y
. Additionally, KNeighborsClassifier
and train_test_split
have been imported from sklearn.neighbors
and sklearn.model_selection
, respectively.
Instructions
StandardScaler
from sklearn.preprocessing
.Pipeline
from sklearn.pipeline
.StandardScaler()
for 'scaler'
and KNeighborsClassifier()
for 'knn'
.Pipeline()
and steps
.30%
used for testing. Use a random state of 42
..score()
method inside the provided print()
functions.# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)
# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)
# Compute and print metrics
print(f'Accuracy with Scaling: {knn_scaled.score(X_test, y_test)}')
print(f'Accuracy without Scaling: {knn_unscaled.score(X_test, y_test)}')
It is time now to piece together everything you have learned so far into a pipeline for classification! Your job in this exercise is to build a pipeline that includes scaling and hyperparameter tuning to classify wine quality.
You'll return to using the SVM classifier you were briefly introduced to earlier in this chapter. The hyperparameters you will tune are $C$ and $gamma$. $C$ controls the regularization strength. It is analogous to the $C$ you tuned for logistic regression in Chapter 3, while $gamma$ controls the kernel coefficient: Do not worry about this now as it is beyond the scope of this course.
The following modules and functions have been pre-loaded: Pipeline
, SVC
, train_test_split
, GridSearchCV
, classification_report
, accuracy_score
. The feature and target variable arrays X
and y
have also been pre-loaded.
Instructions
'scaler'
with StandardScaler()
.'SVM'
with SVC()
.'step_name__parameter_name'
. Here, the step_name
is SVM
, and the parameter_name
s are C
and gamma
.21
.GridSearchCV
with the pipeline and hyperparameter space and fit it to the training set. Use 3-fold cross-validation (This is the default, so you don't have to specify it).# Setup the pipeline
steps = [('scaler', StandardScaler()), ('SVM', SVC())]
pipeline = Pipeline(steps)
# Specify the hyperparameter space
parameters = {'SVM__C':[1, 10, 100], 'SVM__gamma':[0.1, 0.01]}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
# Instantiate the GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid=parameters)
# Fit to the training set
cv.fit(X_train, y_train)
# Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)
# Compute and print metrics
print(f"Accuracy: {cv.score(X_test, y_test)}")
print(classification_report(y_test, y_pred))
print(f"Tuned Model Parameters: {cv.best_params_}")
For this final exercise, you will return to the Gapminder dataset. Guess what? Even this dataset has missing values that we dealt with for you in earlier chapters! Now, you have all the tools to take care of them yourself!
Your job is to build a pipeline that imputes the missing data, scales the features, and fits an ElasticNet to the Gapminder data. You will then tune the l1_ratio
of your ElasticNet using GridSearchCV.
All the necessary modules have been imported, and the feature and target variable arrays have been pre-loaded as X
and y
.
Instructions
'imputation'
, which uses the Imputer()
transformer and the 'mean'
strategy to impute missing data ('NaN'
) using the mean of the column.'scaler'
, which scales the features using StandardScaler()
.'elasticnet'
, which instantiates an ElasticNet()
regressor.'step_name__parameter_name'
. Here, the step_name
is elasticnet
, and the parameter_name
is l1_ratio
.42
.GridSearchCV
with the pipeline and hyperparameter space. Use 3-fold cross-validation (This is the default, so you don't have to specify it).GridSearchCV
object to the training set.gm = pd.read_csv(datasets[3])
y = gm.pop('life').to_numpy()
X = gm.iloc[:, :-1].to_numpy()
# Setup the pipeline steps: steps
steps = [('imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
('scaler', StandardScaler()),
('elasticnet', ElasticNet())]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Specify the hyperparameter space
parameters = {'elasticnet__l1_ratio':np.linspace(0,1,30)}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# Create the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(pipeline, param_grid=parameters)
# Fit to the training set
gm_cv.fit(X_train, y_train)
# Compute and print the metrics
r2 = gm_cv.score(X_test, y_test)
print(f"Tuned ElasticNet Alpha: {gm_cv.best_params_}")
print(f"Tuned ElasticNet R squared: {r2}")
What you’ve learned
What you’ve learned