scikit-learn v0.19.2
, pandas v0.19.2
, and numpy v1.17.4
v0.24.1
, v1.2.4
, and v1.19.2
respectively, so there are differences in model performance compared to the course.create_dir_save_file
) to automatically download and save the required data (data/course_name
) and image (Images/course_name
) files.Say you have a collection of customers with a variety of characteristics such as age, location, and financial history, and you wish to discover patterns and sort them into clusters. Or perhaps you have a set of texts, such as Wikipedia pages, and you wish to segment them into categories based on their content. This is the world of unsupervised learning, called as such because you are not guiding, or supervising, the pattern discovery by some prediction task, but instead uncovering hidden structure from unlabeled data. Unsupervised learning encompasses a variety of techniques in machine learning, from clustering to dimension reduction to matrix factorization. In this course, you'll learn the fundamentals of unsupervised learning and implement the essential algorithms using scikit-learn and scipy. You will learn how to cluster, transform, visualize, and extract insights from unlabeled datasets, and end the course by building a recommender system to recommend popular musical artists.
import pandas as pd
from pprint import pprint as pp
from itertools import combinations
from zipfile import ZipFile
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
import numpy as np
from pathlib import Path
import requests
import sys
from scipy.sparse import csr_matrix
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from scipy.stats import pearsonr
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, Normalizer, normalize, MaxAbsScaler
from sklearn.pipeline import make_pipeline
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD, NMF
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings
warnings.simplefilter(action="ignore", category=UserWarning)
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)
plt.rcParams["patch.force_edgecolor"] = True
def create_dir_save_file(dir_path: Path, url: str):
"""
Check if the path exists and create it if it does not.
Check if the file exists and download it if it does not.
"""
if not dir_path.parents[0].exists():
dir_path.parents[0].mkdir(parents=True)
print(f'Directory Created: {dir_path.parents[0]}')
else:
print('Directory Exists')
if not dir_path.exists():
r = requests.get(url, allow_redirects=True)
open(dir_path, 'wb').write(r.content)
print(f'File Created: {dir_path.name}')
else:
print('File Exists')
data_dir = Path('data/2021-03-29_unsupervised_learning_python')
images_dir = Path('Images/2021-03-29_unsupervised_learning_python')
# csv files
base = 'https://assets.datacamp.com/production/repositories/655/datasets'
file_spm = base + '/1304e66b1f9799e1a5eac046ef75cf57bb1dd630/company-stock-movements-2010-2015-incl.csv'
file_ev = base + '/2a1f3ab7bcc76eef1b8e1eb29afbd54c4ebf86f2/eurovision-2016.csv'
file_fish = base + '/fee715f8cf2e7aad9308462fea5a26b791eb96c4/fish.csv'
file_lcd = base + '/effd1557b8146ab6e620a18d50c9ed82df990dce/lcd-digits.csv'
file_wine = base + '/2b27d4c4bdd65801a3b5c09442be3cb0beb9eae0/wine.csv'
file_artists_sparse = 'https://raw.githubusercontent.com/trenton3983/DataCamp/master/data/2021-03-29_unsupervised_learning_python/artists_sparse.csv'
# zip files
file_grain = base + '/bb87f0bee2ac131042a01307f7d7e3d4a38d21ec/Grains.zip'
file_musicians = base + '/c974f2f2c4834958cbe5d239557fbaf4547dc8a3/Musical%20artists.zip'
file_wiki = base + '/8e2fbb5b8240c06602336f2148f3c42e317d1fdb/Wikipedia%20articles.zip'
file_links = [file_spm, file_ev, file_fish, file_lcd, file_wine, file_grain, file_musicians, file_wiki, file_artists_sparse]
file_paths = list()
for file in file_links:
file_name = file.split('/')[-1].replace('?raw=true', '').replace('%20', '_')
data_path = data_dir / file_name
create_dir_save_file(data_path, file)
file_paths.append(data_path)
# unzip the zipped files
zip_files = [v for v in file_paths if v.suffix == '.zip']
for file in zip_files:
with ZipFile(file, 'r') as zip_:
zip_.extractall(data_dir)
dp = [v for v in data_dir.rglob('*') if v.suffix in ['.csv', '.txt']]
dp
stk
: Company Stock Movements 2010 - 2015¶stk = pd.read_csv(dp[1], index_col=[0])
stk.iloc[:2, :5]
euv
: Eurovision 2016¶euv = pd.read_csv(dp[2])
euv.head(2)
fsh
: Fish¶fsh = pd.read_csv(dp[3], header=None)
fsh.head(2)
lcd
: LCD Digits¶lcd = pd.read_csv(dp[4], header=None)
lcd.iloc[:2, :5]
win
: Wine¶win = pd.read_csv(dp[5])
win.head(2)
swl
: Seeds Width vs. Length¶swl = pd.read_csv(dp[6], header=None)
swl.columns = ['width', 'length']
swl.head(2)
sed
: Seeds¶sed = pd.read_csv(dp[7], header=None)
sed['varieties'] = sed[7].map({1: 'Kama wheat', 2: 'Rosa wheat', 3: 'Canadian wheat'})
sed.head(2)
mus1
: Musical Artists¶mus1 = pd.read_csv(dp[8])
mus1.head(2)
mus2
: Musical Artists - Scrobbler Small Sample¶mus2 = pd.read_csv(dp[9])
mus2.head(2)
artists_sparse
¶artist_df = pd.read_csv(dp[0], header=None, index_col=[0])
artist_names = artist_df.index.tolist()
artists_sparse = csr_matrix(artist_df)
wik1
: Wikipedia Vectors¶wik1 = pd.read_csv(dp[10], index_col=0).T
wik1.iloc[:4, :10]
wik1_sparse = csr_matrix(wik1)
wik1_sparse
wik2
: Wikipedia Vocabulary¶wik2 = pd.read_csv(dp[11], header=None)
wik2.head(2)
# These are the usual ipython objects, including this one you are creating
ipython_vars = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars'] # list a variables
# Get a sorted list of the objects and their sizes
sorted([(x, sys.getsizeof(globals().get(x))) for x in dir() if not x.startswith('_') and x not in sys.modules and x not in ipython_vars], key=lambda x: x[1], reverse=True)[:11]
Learn how to discover the underlying groups (or "clusters") in a dataset. By the end of this chapter, you'll be clustering companies using their stock market prices, and distinguishing different species by clustering their measurements.
Supervised vs unsupervised learning
Iris dataset
Arrays, features & samples
Iris data is 4-dimensional
k-means clustering
kmeans
in action on some samples from the iris dataset.
k-means clustering with scikit-learn
Cluster labels for new samples
Scatter plots
iris = sns.load_dataset('iris')
iris_samples = iris.sample(n=75, replace=False, random_state=3)
X_iris = iris_samples.iloc[:, :4]
y_iris = iris_samples.species
iris_samples.head()
iris_model = KMeans(n_clusters=3)
iris_model.fit(X_iris)
iris_labels = iris_model.predict(X_iris)
iris_labels
iris_new_samples = iris[~iris.index.isin(iris_samples.index)].copy()
X_iris_new = iris_new_samples.iloc[:, :4]
y_iris_new = iris_new_samples.species
iris_new_labels = iris_model.predict(X_iris_new)
iris_new_labels
iris_new_samples['pred_labels'] = iris_new_labels
iris_samples['pred_labels'] = iris_labels
pred_labels = pd.concat([iris_new_samples[['species', 'pred_labels']], iris_samples[['species', 'pred_labels']]]).sort_index()
pred_labels.head(2)
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))
xs = X_iris.sepal_length
ys = X_iris.petal_length
xs_new = X_iris_new.sepal_length
ys_new = X_iris_new.petal_length
ax1.scatter(xs, ys, c=iris_labels)
ax1.set_ylabel('Petal Length')
ax1.set_xlabel('Sepal Length')
ax1.set_title('Sample')
ax2.scatter(xs_new, ys_new, c=iris_new_labels)
ax2.set_ylabel('Petal Length')
ax2.set_xlabel('Sepal Length')
ax2.set_title('New Sample')
plt.show()
You are given an array points
of size 300x2, where each row gives the (x, y) co-ordinates of a point on a map. Make a scatter plot of these points, and use the scatter plot to guess how many clusters there are.
matplotlib.pyplot
has already been imported as plt
. In the IPython Shell:
xs
that contains the values of points[:,0]
- that is, column 0
of points
.ys
that contains the values of points[:,1]
- that is, column 1
of points
.xs
and ys
to the plt.scatter()
function.plt.show()
function to show your plot.How many clusters do you see?
Possible Answers
pen = sns.load_dataset('penguins').dropna()
pen
points = pen.iloc[:, 2:4]
points.head()
xs = points.culmen_length_mm
ys = points.culmen_depth_mm
sns.scatterplot(x=xs, y=ys, hue=pen.species)
plt.legend(title='Species', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title('Species With Real Labels')
plt.show()
From the scatter plot of the previous exercise, you saw that the points seem to separate into 3 clusters. You'll now create a KMeans model to find 3 clusters, and fit it to the data points from the previous exercise. After the model has been fit, you'll obtain the cluster labels for some new points using the .predict()
method.
You are given the array points
from the previous exercise, and also an array new_points
.
Instructions
KMeans
from sklearn.cluster
.KMeans()
, create a KMeans
instance called model
to find 3
clusters. To specify the number of clusters, use the n_clusters
keyword argument..fit()
method of model
to fit the model to the array of points points
..predict()
method of model
to predict the cluster labels of new_points
, assigning the result to labels
.new_points
.# create points
points = pen.iloc[:, 2:4].sample(n=177, random_state=3)
new_points = pen[~pen.index.isin(points.index)].iloc[:, 2:4]
# Import KMeans
# from sklearn.cluster import KMeans
# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters=3)
# Fit model to points
model.fit(points)
labels = model.predict(points)
# Determine the cluster labels of new_points: labels
new_labels = model.predict(new_points)
# Print cluster labels of new_points
print(new_labels)
points.head()
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))
xs = points.culmen_length_mm
ys = points.culmen_depth_mm
xs_new = new_points.culmen_length_mm
ys_new = new_points.culmen_depth_mm
ax1.scatter(xs, ys, c=labels)
ax1.set_ylabel('Culmen Depth (mm)')
ax1.set_xlabel('Culmen Length (mm)')
ax1.set_title('Points: Predicted Labels')
ax2.scatter(xs_new, ys_new, c=new_labels)
ax2.set_ylabel('Culmen Depth (mm)')
ax2.set_xlabel('Culmen Length (mm)')
ax2.set_title('New Points: Predicted Labels')
plt.show()
You've successfully performed k-Means clustering and predicted the labels of new points. But it is not easy to inspect the clustering by just looking at the printed labels. A visualization would be far more useful. In the next exercise, you'll inspect your clustering with a scatter plot!
Let's now inspect the clustering you performed in the previous exercise!
A solution to the previous exercise has already run, so new_points
is an array of points and labels
is the array of their cluster labels.
Instructions
matplotlib.pyplot
as plt
.0
of new_points
to xs
, and column 1
of new_point
s to ys
.xs
and ys
, specifying the c=labels
keyword arguments to color the points by their cluster label. Also specify alpha=0.5
..cluster_centers_
attribute of model
.0
of centroids
to centroids_x
, and column 1
of centroids
to centroids_y
.centroids_x
and centroids_y
, using 'D'
(a diamond) as a marker
by specifying the marker parameter. Set the size of the markers to be 50 using s=50
.# Import pyplot
# import matplotlib.pyplot as plt
new_points = new_points.to_numpy()
# Assign the columns of new_points: xs and ys
xs = new_points[:, 0]
ys = new_points[:, 1]
# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs, ys, c=new_labels, alpha=0.5)
# Assign the cluster centers: centroids
centroids = model.cluster_centers_
# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]
# Make a scatter plot of centroids_x and centroids_y
plt.scatter(centroids_x, centroids_y, marker='D', s=50)
plt.show()
The clustering looks great! But how can you be sure that 3 clusters is the correct choice? In other words, how can you evaluate the quality of a clustering? Tune into the next video in which Ben will explain how to evaluate a clustering!
Evaluating a clustering
Iris: clusters vs species
Cross tabulation with pandas
Aligning labels and species
Crosstab of labels and species
Measuring clustering quality
Inertia measures clustering quality
.fit()
methods are called, and is available afterwards as the .inertia_
attribute.The number of clusters
How many clusters to choose?
ct = pd.crosstab(pred_labels.pred_labels, pred_labels.species)
ct
iris_model.inertia_
Sum_of_squared_distances = list()
K = range(1, 10)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(X_iris)
Sum_of_squared_distances.append(km.inertia_)
plt.figure(figsize=(8, 5))
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.grid()
plt.show()
In the video, you learned how to choose a good number of clusters for a dataset using the k-means inertia graph. You are given an array
samples containing the measurements (such as area, perimeter, length, and several others) of samples of grain. What's a good number of clusters in this case?
KMeans
and PyPlot (plt
) have already been imported for you.
This dataset was sourced from the UCI Machine Learning Repository.
Instructions
k
, perform the following steps:KMeans
instance called model
with k
clusters.samples
.inertia_
attribute of model
to the list inertias
.ks
vs inertias
has been written for you, so hit 'Submit Answer' to see the plot!samples = sed.iloc[:, :-2]
ks = range(1, 6)
inertias = list()
for k in ks:
# Create a KMeans instance with k clusters: model
model = KMeans(n_clusters=k)
# Fit model to samples
model.fit(samples)
# Append the inertia to the list of inertias
inertias.append(model.inertia_)
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()
The inertia decreases very slowly from 3 clusters to 4, so it looks like 3 clusters would be a good choice for this data.
In the previous exercise, you observed from the inertia plot that 3 is a good number of clusters for the grain data. In fact, the grain samples come from a mix of 3 different grain varieties: "Kama", "Rosa" and "Canadian". In this exercise, cluster the grain samples into three clusters, and compare the clusters to the grain varieties using a cross-tabulation.
You have the array samples of grain samples
, and a list varieties
giving the grain variety for each sample. Pandas (pd
) and KMeans
have already been imported for you.
Instructions
KMeans
model called model
with 3
clusters..fit_predict()
method of model
to fit it to samples
and derive the cluster labels. Using .fit_predict()
is the same as using .fit()
followed by .predict()
.df
with two columns named 'labels'
and 'varieties'
, using labels
and varieties
, respectively, for the column values. This has been done for you.pd.crosstab()
function on df['labels']
and df['varieties']
to count the number of times each grain variety coincides with each cluster label. Assign the result to ct
.varieties = sed.varieties
# Create a KMeans model with 3 clusters: model
model = KMeans(n_clusters=3)
# Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(samples)
# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})
# Create crosstab: ct
ct = pd.crosstab(df.labels, df.varieties)
# Display ct
ct
The cross-tabulation shows that the 3 varieties of grain separate really well into 3 clusters. But depending on the type of data you are working with, the clustering may not always be this good. Is there anything you can do in such situations to improve your clustering?
Piedmont wines dataset
Clustering the wines
Clusters vs. varieties
Feature variances
StandardScaler
sklearn StandardScaler
Similar methods
StandardScaler, then KMeans
Pipelines combine multiple steps
Feature standardization improves clustering
sklearn preprocessing steps
win.head(2)
wine_samples = win.iloc[:, 2:]
wine_model = KMeans(n_clusters=3)
wine_labels = wine_model.fit_predict(wine_samples)
wine_pred = pd.DataFrame({'labels': wine_labels, 'varieties': win.class_name})
wine_ct = pd.crosstab(wine_pred.labels, wine_pred.varieties)
wine_ct
wine_samples.var().round(3)
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 4))
sns.scatterplot(data=win, x='od280', y='malic_acid', hue='class_name', ax=ax1)
ax1.legend(title='Variety', bbox_to_anchor=(1.05, 1), loc='upper left')
ax1.set_xlim(0, 8)
ax1.set_title('Real Labels')
sns.scatterplot(data=win, x='od280', y='malic_acid', hue=wine_pred.labels, palette="tab10", ax=ax2)
ax2.set_xlim(0, 8)
ax2.set_title('Predicted Labels')
plt.tight_layout()
p1 = sns.scatterplot(data=win, x='od280', y='proline', hue='class_name')
p1.set_xlim(-7.5, 7.5)
p1.set_title('Unscaled Values');
wine_scaler = StandardScaler()
wine_scaler.fit(wine_samples)
StandardScaler(copy=True, with_mean=True, with_std=True)
wine_samples_scaled = wine_scaler.transform(wine_samples)
wine_samples_scaled = pd.DataFrame(wine_samples_scaled, columns=win.columns[2:])
wine_samples_scaled.head(2)
p2 = sns.scatterplot(data=wine_samples_scaled, x='od280', y='proline', hue=win.class_name)
p2.set_xlim(-7.5, 7.5)
p2.set_title('Scaled Values');
scaler = StandardScaler()
kmeans = KMeans(n_clusters=3)
pipeline = make_pipeline(scaler, kmeans)
pipeline.fit(wine_samples_scaled)
wine_scaled_labels = pipeline.predict(wine_samples_scaled)
wine_pred_scaled = pd.DataFrame({'labels': wine_scaled_labels, 'varieties': win.class_name})
wine_scaled_ct = pd.crosstab(wine_pred_scaled.labels, wine_pred_scaled.varieties)
wine_scaled_ct
You are given an array samples
giving measurements of fish. Each row represents an individual fish. The measurements, such as weight in grams, length in centimeters, and the percentage ratio of height to length, have very different scales. In order to cluster this data effectively, you'll need to standardize these features first. In this exercise, you'll build a pipeline to standardize and cluster the data.
These fish measurement data were sourced from the Journal of Statistics Education.
Instructions
make_pipeline
from sklearn.pipeline
.StandardScaler
from sklearn.preprocessing
.KMeans
from sklearn.cluster
.StandardScaler
called scaler
.KMeans
with 4
clusters called kmeans
.pipeline
that chains scaler
and kmeans
. To do this, you just need to pass them in as arguments to make_pipeline()
.# Perform the necessary imports
# from sklearn.pipeline import make_pipeline
# from sklearn.preprocessing import StandardScaler
# from sklearn.cluster import KMeans
# Create scaler: scaler
scaler = StandardScaler()
# Create KMeans instance: kmeans
kmeans = KMeans(n_clusters=4)
# Create pipeline: pipeline
pipeline = make_pipeline(scaler, kmeans)
Now that you've built the pipeline, you'll use it in the next exercise to cluster the fish by their measurements.
You'll now use your standardization and clustering pipeline from the previous exercise to cluster the fish by their measurements, and then create a cross-tabulation to compare the cluster labels with the fish species.
As before, samples
is the 2D array of fish measurements. Your pipeline is available as pipeline
, and the species of every fish sample is given by the list species
.
Instructions
pandas
as pd
.samples
.samples
by using the .predict()
method of pipeline
.pd.DataFrame()
, create a DataFrame df
with two columns named 'labels'
and 'species'
, using labels
and species
, respectively, for the column values.pd.crosstab()
, create a cross-tabulation ct
of df['labels']
and df['species']
samples = fsh.iloc[:, 1:]
species = fsh[0]
# Fit the pipeline to samples
pipeline.fit(samples)
# Calculate the cluster labels: labels
labels = pipeline.predict(samples)
# Create a DataFrame with labels and species as columns: df
df = pd.DataFrame({'labels': labels, 'species': species})
# Create crosstab: ct
ct = pd.crosstab(df.labels, df.species)
# Display ct
ct
In this exercise, you'll cluster companies using their daily stock price movements (i.e. the dollar difference between the closing and opening prices for each trading day). You are given a NumPy array movements
of daily price movements from 2010 to 2015 (obtained from Yahoo! Finance), where each row corresponds to a company, and each column corresponds to a trading day.
Some stocks are more expensive than others. To account for this, include a Normalizer
at the beginning of your pipeline. The Normalizer will separately transform each company's stock price to a relative scale before the clustering begins.
Note that Normalizer()
is different to StandardScaler()
, which you used in the previous exercise. While StandardScaler()
standardizes features (such as the features of the fish data from the previous exercise) by removing the mean and scaling to unit variance, Normalizer()
rescales each sample
- here, each company's stock price - independently of the other.
KMeans
and make_pipeline
have already been imported for you.
Instructions
Normalizer
from sklearn.preprocessing
.Normalizer
called normalizer
.KMeans
called kmeans
with 10
clusters.make_pipeline()
, create a pipeline called pipeline
that chains normalizer
and kmeans
.movements
array.movements = stk.to_numpy()
companies = stk.index.to_list()
# Import Normalizer
# from sklearn.preprocessing import Normalizer
# Create a normalizer: normalizer
normalizer = Normalizer()
# Create a KMeans model with 10 clusters: kmeans
kmeans = KMeans(n_clusters=10, random_state=12)
# Make a pipeline chaining normalizer and kmeans: pipeline
pipeline = make_pipeline(normalizer, kmeans)
# Fit pipeline to the daily price movements
pipeline.fit(movements)
Now that your pipeline has been set up, you can find out which stocks move together in the next exercise!
In the previous exercise, you clustered companies by their daily stock price movements. So which company have stock prices that tend to change in the same way? You'll now inspect the cluster labels from your clustering to find out.
Your solution to the previous exercise has already been run. Recall that you constructed a Pipeline pipeline
containing a KMeans
model and fit it to the NumPy array movements
of daily stock movements. In addition, a list companies
of the company names is available.
Instructions
pandas
as pd
..predict()
method of the pipeline to predict the labels for movements
.companies
by creating a DataFrame df
with labels
and companies
as columns. This has been done for you..sort_values()
method of df
to sort the DataFrame by the 'labels'
column, and print the result.# Predict the cluster labels: labels
labels = pipeline.predict(movements)
# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'labels': labels, 'companies': companies})
# Display df sorted by cluster label
df = df.sort_values('labels')
df
Take a look at the clusters. Are you surprised by any of the results? In the next chapter, you'll learn about how to communicate results such as this through visualizations.
stk_t = stk.T.copy()
stk_t.index = pd.to_datetime(stk_t.index)
stk_t = stk_t.rolling(30).mean()
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(16, 16))
axes = axes.ravel()
for i, (g, d) in enumerate(df.groupby('labels')):
cols = d.companies.tolist()
sns.lineplot(data=stk_t[cols], ax=axes[i])
axes[i].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
axes[i].set_title(f'30-Day Rolling Mean: Group {g}')
axes[i].set_ylim(-3, 3)
fig.autofmt_xdate(rotation=90, ha='center')
plt.tight_layout()
plt.show()
In this chapter, you'll learn about two unsupervised learning techniques for data visualization, hierarchical clustering and t-SNE. Hierarchical clustering merges the data samples into ever-coarser clusters, yielding a tree visualization of the resulting cluster hierarchy. t-SNE maps the data samples into 2d space so that the proximity of the samples to one another can be visualized.
Visualizations communicate insight
A hierarchy of groups
Eurovision scoring dataset
Hierarchical clustering of voting countries
Hierarchical clustering
The dendrogram of a hierarchical clustering
scipy.cluster.hierarchy.dendrogram
Dendrograms, step-by-step
Hierarchical clustering with SciPy
A Note Regarding the Data
euv
, is used for the lecture and some of the following exercises..shape
of the Eurovision samples
is (42, 26)
'From country'
is index
'To country'
is columns
'Jury Points'
is values
samples
produced by DataCamp, they have changed the order of the values for every row, so that the correct data point does not correctly correspond to 'To country'
samples
from the iPython
shell, there isn't an automated way, that I can see, to sort the rows to match the DataCamp example, so the Dendrogram will not look the sameeuvp = euv.pivot(index='From country', columns='To country', values='Jury Points').fillna(0)
euv_samples = euvp.to_numpy()
euvp.iloc[:5, :5]
plt.figure(figsize=(16, 6))
euv_mergings = linkage(euv_samples, method='complete')
dendrogram(euv_mergings, labels=euvp.index, leaf_rotation=90, leaf_font_size=12)
plt.title('Countries Hierarchically Clustered by Eurovision 2016 Voting')
plt.show()
If there are 5 data samples, how many merge operations will occur in a hierarchical clustering?
(To help answer this question, think back to the video, in which Ben walked through an example of hierarchical clustering using 6 countries.)
Possible Answers
In the video, you learned that the SciPy linkage()
function performs hierarchical clustering on an array of samples. Use the linkage()
function to obtain a hierarchical clustering of the grain samples, and use dendrogram()
to visualize the result. A sample of the grain measurements is provided in the array samples
, while the variety of each grain sample is given by the list varieties
.
Instructions
linkage
and dendrogram
from scipy.cluster.hierarchy
.matplotlib.pyplot
as plt
.samples
using the linkage()
function with the method='complete'
keyword argument. Assign the result to mergings
.dendrogram()
function on mergings
. Specify the keyword arguments labels=varieties
, leaf_rotation=90
, and leaf_font_size=6
.# the DataCamp sample uses a subset of the seed data; the linkage result is very dependant upon the random_state
seed_sample = sed.groupby('varieties').sample(n=14, random_state=250)
samples = seed_sample.iloc[:, :7]
varieties = seed_sample.varieties.tolist()
# Perform the necessary imports
# from scipy.cluster.hierarchy import linkage, dendrogram
# import matplotlib.pyplot as plt
# Calculate the linkage: mergings
mergings = linkage(samples, method='complete')
# Plot the dendrogram, using varieties as labels
plt.figure(figsize=(15, 6))
dendrogram(mergings, labels=varieties, leaf_rotation=90, leaf_font_size=10)
plt.show()
Dendrograms are a great way to illustrate the arrangement of the clusters produced by hierarchical clustering.
In chapter 1, you used k-means clustering to cluster companies according to their stock price movements. Now, you'll perform hierarchical clustering of the companies. You are given a NumPy array of price movements movements
, where the rows correspond to companies, and a list of the company names companies
. SciPy hierarchical clustering doesn't fit into a sklearn pipeline, so you'll need to use the normalize()
function from sklearn.preprocessing
instead of Normalizer
.
linkage
and dendrogram
have already been imported from scipy.cluster.hierarchy
, and PyPlot has been imported as plt
.
Instructions
normalize
from sklearn.preprocessing
.normalize()
function on movements
.linkage()
function to normalized_movements
, using 'complete'
linkage, to calculate the hierarchical clustering. Assign the result to mergings
.companies
of company names as the labels
. In addition, specify the leaf_rotation=90
, and leaf_font_size=6
keyword arguments as you did in the previous exercise.# Import normalize
# from sklearn.preprocessing import normalize
# Normalize the movements: normalized_movements
normalized_movements = normalize(stk)
# Calculate the linkage: mergings
mergings = linkage(normalized_movements, method='complete')
# Plot the dendrogram
plt.figure(figsize=(15, 6))
dendrogram(mergings, labels=stk.index, leaf_rotation=90, leaf_font_size=10)
plt.show()
Cluster labels in hierarchical clustering
Intermediate clusterings & height on dendrogram
Dendrograms show cluster distances
Intermediate clusterings & height on dendrogram
Distance between clusters
Extracting cluster labels
Extracting cluster labels using fcluster
Aligning cluster labels with country names
mergings = linkage(euv_samples, method='complete')
labels = fcluster(mergings, 15, criterion='distance')
print(labels)
pairs = pd.DataFrame({'labels': labels, 'countries': euvp.index}).sort_values('labels')
pairs
In the video, you learned that the linkage method defines how the distance between clusters is measured. In complete linkage, the distance between clusters is the distance between the furthest points of the clusters. In single linkage, the distance between clusters is the distance between the closest points of the clusters.
Consider the three clusters in the diagram. Which of the following statements are true?
A. In single linkage, Cluster 3 is the closest to Cluster 2.
B. In complete linkage, Cluster 1 is the closest to Cluster 2.
Possible Answers
In the video, you saw a hierarchical clustering of the voting countries at the Eurovision song contest using 'complete'
linkage. Now, perform a hierarchical clustering of the voting countries with 'single'
linkage, and compare the resulting dendrogram with the one in the video. Different linkage, different hierarchical clustering!
You are given an array samples
. Each row corresponds to a voting country, and each column corresponds to a performance that was voted for. The list country_names
gives the name of each voting country. This dataset was obtained from Eurovision.
Instructions
linkage
and dendrogram
from scipy.cluster.hierarchy
.samples
using the linkage()
function with the method='single'
keyword argument. Assign the result to mergings
.country_names
as the labels
. In addition, specify the leaf_rotation=90
, and leaf_font_size=6
keyword arguments as you have done earlier.country_names = euv['From country'].unique()
# Perform the necessary imports
# import matplotlib.pyplot as plt
# from scipy.cluster.hierarchy import linkage, dendrogram
# Calculate the linkage: mergings
mergings = linkage(euv_samples, method='single')
# Plot the dendrogram
plt.figure(figsize=(16, 6))
dendrogram(mergings, labels=country_names, leaf_rotation=90, leaf_font_size=12)
plt.show()
As you can see, performing single linkage hierarchical clustering produces a different dendrogram!
Displayed on the right is the dendrogram for the hierarchical clustering of the grain samples that you computed earlier. If the hierarchical clustering were stopped at height 6 on the dendrogram, how many clusters would there be?
Possible Answers
In the previous exercise, you saw that the intermediate clustering of the grain samples at height 6 has 3 clusters. Now, use the fcluster()
function to extract the cluster labels for this intermediate clustering, and compare the labels with the grain varieties using a cross-tabulation.
The hierarchical clustering has already been performed and mergings
is the result of the linkage()
function. The list varieties
gives the variety of each grain sample.
Instructions
pandas
as pd
.fcluster
from scipy.cluster.hierarchy
.fcluster()
function on mergings
. Specify a maximum height of 6
and the keyword argument criterion='distance'
.df
with two columns named 'labels'
and 'varieties'
, using labels
and varieties
, respectively, for the column values. This has been done for you.ct
between df['labels']
and df['varieties']
to count the number of times each grain variety coincides with each cluster label.# the DataCamp sample uses a subset of the seed data; the linkage result is very dependant upon the random_state
seed_sample = sed.groupby('varieties').sample(n=14, random_state=250)
samples = seed_sample.iloc[:, :7]
varieties = seed_sample.varieties.tolist()
# Calculate the linkage: mergings
mergings = linkage(samples, method='complete')
# Perform the necessary imports
# import pandas as pd
# from scipy.cluster.hierarchy import fcluster
# Use fcluster to extract labels: labels
labels = fcluster(mergings, 6, criterion='distance')
# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})
# Create crosstab: ct
ct = pd.crosstab(df.labels, df.varieties)
# Display ct
ct
t-SNE for 2-dimensional maps
t-SNE on the iris dataset
Interpreting t-SNE scatter plots
t-SNE in sklearn
t-SNE has only fit_transform()
t-SNE learning rate
Different every time
rs = [100, 200, 300]
fig, axes = plt.subplots(ncols=3, figsize=(15, 3))
axes = axes.ravel()
for i, state in enumerate(rs):
ax = axes[i]
model = TSNE(learning_rate=100, random_state=state)
transformed = model.fit_transform(iris.iloc[:, :4])
xs = transformed[:, 0]
ys = transformed[:, 1]
sns.scatterplot(x=xs, y=ys, hue=iris.species, ax=ax)
ax.set_title(f't-SNE applied to Iris with random_state={state}')
plt.tight_layout()
plt.show()
In the video, you saw t-SNE applied to the iris dataset. In this exercise, you'll apply t-SNE to the grain samples data and inspect the resulting t-SNE features using a scatter plot. You are given an array samples
of grain samples and a list variety_numbers
giving the variety number of each grain sample.
Instructions
TSNE
from sklearn.manifold
.model
with learning_rate=200
..fit_transform()
method of model
to samples
. Assign the result to tsne_features
.0
of tsne_features
. Assign the result to xs
.1
of tsne_features
. Assign the result to ys
.xs
and ys
. To color the points by the grain variety, specify the additional keyword argument c=variety_numbers
.samples = sed.iloc[:, :7]
variety_numbers = sed[7]
variety_names = sed.varieties
# Import TSNE
# from sklearn.manifold import TSNE
# Create a TSNE instance: model
model = TSNE(learning_rate=200, random_state=300)
# Apply fit_transform to samples: tsne_features
tsne_features = model.fit_transform(samples)
# Select the 0th feature: xs
xs = tsne_features[:,0]
# Select the 1st feature: ys
ys = tsne_features[:,1]
# Scatter plot, coloring by variety_numbers
# plt.scatter(xs, ys, c=variety_numbers)
sns.scatterplot(x=xs, y=ys, hue=variety_names)
plt.show()
t-SNE provides great visualizations when the individual samples can be labeled. In this exercise, you'll apply t-SNE to the company stock price data. A scatter plot of the resulting t-SNE features, labeled by the company names, gives you a map of the stock market! The stock price movements for each company are available as the array normalized_movements
(these have already been normalized for you). The list companies
gives the name of each company. PyPlot (plt
) has been imported for you.
Instructions
TSNE
from sklearn.manifold
.model
with learning_rate=50
..fit_transform()
method of model
to normalized_movements
. Assign the result to tsne_features
.0
and column 1
of tsne_features
.xs
and ys
. Specify the additional keyword argument alpha=0.5
.plt.annotate()
, so just hit 'Submit Answer' to see the visualization!# Import TSNE
# from sklearn.manifold import TSNE
# Create a TSNE instance: model
model = TSNE(learning_rate=50, random_state=300)
# Apply fit_transform to normalized_movements: tsne_features
tsne_features = model.fit_transform(normalized_movements)
# Select the 0th feature: xs
xs = tsne_features[:, 0]
# Select the 1th feature: ys
ys = tsne_features[:, 1]
# Scatter plot
plt.figure(figsize=(16, 10))
plt.scatter(xs, ys, alpha=0.5)
# Annotate the points
for x, y, company in zip(xs, ys, companies):
plt.annotate(company, (x, y), fontsize=10, alpha=0.75)
plt.show()
It's visualizations such as this that make t-SNE such a powerful tool for extracting quick insights from high dimensional data.
Dimension reduction summarizes a dataset using its common occuring patterns. In this chapter, you'll learn about the most fundamental of dimension reduction techniques, "Principal Component Analysis" ("PCA"). PCA is often used before supervised learning to improve model performance and generalization. It can also be useful for unsupervised learning. For example, you'll employ a variant of PCA will allow you to cluster Wikipedia articles by their content!
Dimension reduction
Principal Component Analysis
PCA aligns data with axes
PCA follows the fit/transform pattern
PCA
, and it has fit()
and transform()
methods just like StandardScaler
.Using scikit-learn PCA
from sklearn.decomposition import PCA
PCA features
PCA features are not correlated
Pearson correlation
Principal components
wine_samples = win[['total_phenols', 'od280']]
wine_samples.head(3)
wine_samples.corr().round(1)
wine_model = PCA()
wine_model.fit(wine_samples)
wine_transformed = wine_model.transform(wine_samples)
wine_transformed_df = pd.DataFrame(wine_transformed, columns=['total_phenols', 'od280'])
wine_transformed_df.head(3)
wine_transformed_df.corr().round(1)
wine_model.components_
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 4))
sns.scatterplot(data=wine_samples, x='total_phenols', y='od280', hue=win.class_name, ax=ax1)
ax1.set_ylim(-4, 6)
ax1.set_xlim(-4, 6)
ax1.set_title('Not Scaled')
sns.scatterplot(data=wine_transformed_df, x='total_phenols', y='od280', hue=win.class_name, ax=ax2)
ax2.set_ylim(-4, 6)
ax2.set_xlim(-4, 6)
ax2.set_title('PCA Scaled')
plt.tight_layout()
plt.show()
You are given an array grains
giving the width and length of samples of grain. You suspect that width and length will be correlated. To confirm this, make a scatter plot of width vs length and measure their Pearson correlation.
Instructions
matplotlib.pyplot
as plt
.pearsonr
from scipy.stats
.0
of grains
to width
and column 1
of grains
to length
.width
on the x-axis and length
on the y-axis.pearsonr()
function to calculate the Pearson correlation of width
and length
.grains = sed[[4, 3]].to_numpy()
varieties = sed[7]
grains[:2, :]
# Perform the necessary imports
# import matplotlib.pyplot as plt
# from scipy.stats import pearsonr
# Assign the 0th column of grains: width
width = grains[:, 0]
# Assign the 1st column of grains: length
length = grains[:, 1]
# Scatter plot width vs length
plt.scatter(width, length, c=varieties)
plt.axis('equal')
plt.show()
# Calculate the Pearson correlation
correlation, pvalue = pearsonr(width, length)
# Display the correlation
print(correlation)
p = sns.scatterplot(data=sed, x=4, y=3, hue='varieties')
p.set_xlabel('width')
p.set_ylabel('length')
sed[[4, 3]].corr()
You observed in the previous exercise that the width and length measurements of the grain are correlated. Now, you'll use PCA to decorrelate these measurements, then plot the decorrelated points and measure their Pearson correlation.
Instructions
PCA
from sklearn.decomposition
.PCA
called model
..fit_transform()
method of model
to apply the PCA transformation to grains
. Assign the result to pca_features
.pca_features
has been written for you, so hit 'Submit Answer' to see the result!# Import PCA
# from sklearn.decomposition import PCA
# Create PCA instance: model
model = PCA()
# Apply the fit_transform method of model to grains: pca_features
pca_features = model.fit_transform(grains)
# Assign 0th column of pca_features: xs
xs = pca_features[:,0]
# Assign 1st column of pca_features: ys
ys = pca_features[:,1]
# Scatter plot xs vs ys
plt.scatter(xs, ys, c=varieties)
plt.axis('equal')
plt.show()
# Calculate the Pearson correlation of xs and ys
correlation, pvalue = pearsonr(xs, ys)
# Display the correlation
print(f'Correlation: {round(correlation)}')
There are three scatter plots of the same point cloud. Each scatter plot shows a different set of axes (in red). In which of the plots could the axes represent the principal components of the point cloud?
Recall that the principal components are the directions along which the the data varies.
Possible Answers
Intrinsic dimension of a flight path
Intrinsic dimension
Versicolor dataset
Versicolor dataset has intrinsic dimension 2
PCA identifies intrinsic dimension
PCA of the versicolor samples
PCA features are ordered by variance descending
Variance and intrinsic dimension
Plotting the variances of PCA features
Intrinsic dimension can be ambiguous
iris = sns.load_dataset('iris')
iris.head()
y = iris.species.astype('category').cat.codes
vers = iris[iris.species.eq('versicolor')]
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=15, azim=40)
ax.scatter(iris.sepal_length, iris.sepal_width, iris.petal_width, c=y, edgecolor='k', s=40)
ax.set_title("Iris")
ax.set_xlabel("Sepal Length")
ax.set_ylabel("Sepal Width")
ax.set_zlabel("Petal Width")
plt.show()
pca = PCA()
iris_reduced = pca.fit_transform(iris[['sepal_length', 'sepal_width', 'petal_width']])
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=25, azim=55)
ax.scatter(iris_reduced[:, 0], iris_reduced[:, 1], iris_reduced[:, 2], c=y, edgecolor='k', s=40)
ax.set_title("Iris Reduced")
ax.set_xlabel("Sepal Length")
ax.set_ylabel("Sepal Width")
ax.set_zlabel("Petal Width")
plt.show()
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=25, azim=-235)
ax.scatter(vers.sepal_length, vers.sepal_width, vers.petal_width, edgecolor='k', s=40)
ax.set_title("Versicolor")
ax.set_xlabel("Sepal Length")
ax.set_ylabel("Sepal Width")
ax.set_zlabel("Petal Width")
ax.set_xlim(4.5, 7.5)
ax.set_ylim(1.5, 4.0)
ax.set_zlim(0, 2.5)
plt.show()
pca = PCA()
pca.fit(vers[['sepal_length', 'sepal_width', 'petal_width']])
vers_reduced = pca.transform(vers[['sepal_length', 'sepal_width', 'petal_width']])
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=15, azim=-245)
ax.scatter(vers_reduced[:, 0], vers_reduced[:, 1], vers_reduced[:, 2], edgecolor='k', s=40)
ax.set_title("Versicolor Reduced")
ax.set_xlabel("Sepal Length")
ax.set_ylabel("Sepal Width")
ax.set_zlabel("Petal Width")
ax.set_xlim(-1.5, 1.5)
ax.set_ylim(-1.5, 1.5)
ax.set_zlim(-1.5, 1.5)
plt.show()
features = range(pca.n_components_)
features
pca.explained_variance_
versi_df = pd.DataFrame(vers_reduced, columns=['sepal_length', 'sepal_width', 'petal_width'])
versi_df.var().plot(kind='bar')
versi_df.var()
The first principal component of the data is the direction in which the data varies the most. In this exercise, your job is to use PCA to find the first principal component of the length and width measurements of the grain samples, and represent it as an arrow on the scatter plot.
The array grains
gives the length and width of the grain samples. PyPlot (plt
) and PCA
have already been imported for you.
Instructions
PCA
instance called model
.grains
data..mean_
attribute of model
.model
using the .components_[0,:]
attribute.plt.arrow()
function. You have to specify the first two arguments - mean[0]
and mean[1]
.# Make a scatter plot of the untransformed points
plt.scatter(grains[:,0], grains[:,1])
# Create a PCA instance: model
model = PCA()
# Fit model to points
model.fit(grains)
# Get the mean of the grain samples: mean
mean = model.mean_
# Get the first principal component: first_pc
first_pc = model.components_[0, :]
# Plot first_pc as an arrow, starting at mean
plt.arrow(mean[0], mean[1], first_pc[0], first_pc[1], color='red', width=0.01)
# Keep axes on same scale
plt.axis('equal')
plt.show()
This is the direction in which the grain data varies the most.
The fish dataset is 6-dimensional. But what is its intrinsic dimension? Make a plot of the variances of the PCA features to find out. As before, samples
is a 2D array, where each row represents a fish. You'll need to standardize the features first.
Instructions
StandardScaler
called scaler
.PCA
instance called pca
.make_pipeline()
function to create a pipeline chaining scaler
and pca
..fit()
method of pipeline
to fit it to the fish samples samples
..n_components_
attribute of pca
. Place this inside a range()
function and store the result as features
.plt.bar()
function to plot the explained variances, with features
on the x-axis and pca.explained_variance_
on the y-axis.samples = fsh.iloc[:, 1:].to_numpy()
samples[:3, :]
# Perform the necessary imports
# from sklearn.decomposition import PCA
# from sklearn.preprocessing import StandardScaler
# from sklearn.pipeline import make_pipeline
# import matplotlib.pyplot as plt
# Create scaler: scaler
scaler = StandardScaler()
# Create a PCA instance: pca
pca = PCA()
# Create pipeline: pipeline
pipeline = make_pipeline(scaler, pca)
# Fit the pipeline to 'samples'
pipeline.fit(samples)
# Plot the explained variances
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_)
plt.xlabel('PCA feature')
plt.ylabel('variance')
plt.xticks(features)
plt.show()
It looks like PCA features 0 and 1 have significant variance.
In the previous exercise, you plotted the variance of the PCA features of the fish measurements. Looking again at your plot, what do you think would be a reasonable choice for the "intrinsic dimension" of the fish measurements? Recall that the intrinsic dimension is the number of PCA features with significant variance.
Possible Answers
Dimension reduction
Dimension reduction with PCA
Dimension reduction of iris dataset
Iris dataset in 2 dimensions
Dimension reduction with PCA
Word frequency arrays
Sparse arrays and csr_matrix
scipy.sparse.csr_matrix
instead of NumPy arraycsr_matrices
save space by remembering only the non-zero entries of the array.TruncatedSVD and csr_matrix
PCA
doesn't supportcsr_matrices
, and you'll need to use TruncatedSVD
instead.TruncatedSVD
performs the same transformation as PCA, but accepts csr matrices as input.
Dimension Reduction of the Iris Dataset
iris.iloc[:, :4].shape
pca = PCA(n_components=2)
pca.fit(iris.iloc[:, :4])
transformed = pca.transform(iris.iloc[:, :4])
transformed.shape
xs = transformed[:,0]
ys = transformed[:,1]
sns.scatterplot(x=xs, y=ys, hue=iris.species)
plt.show()
TruncatedSVD and csr_matrix
wik1.shape
wik1.iloc[:3, :6]
model = TruncatedSVD(n_components=3)
model.fit(wik1) # documents is csr_matrix
TruncatedSVD(algorithm='randomized')
transformed = model.transform(wik1)
transformed.shape
transformed[:3, :]
In a previous exercise, you saw that 2
was a reasonable choice for the "intrinsic dimension" of the fish measurements. Now use PCA for dimensionality reduction of the fish measurements, retaining only the 2 most important components.
The fish measurements have already been scaled for you, and are available as scaled_samples
.
Instructions
PCA
from sklearn.decomposition
.pca
with n_components=2
..fit()
method of pca
to fit it to the scaled fish measurements scaled_samples
..transform()
method of pca
to transform the scaled_samples
. Assign the result to pca_features
.fsh.info()
scaler = StandardScaler()
scaler.fit(fsh.iloc[:, 1:])
scaled_samples = scaler.transform(fsh.iloc[:, 1:])
scaled_samples.shape
scaled_samples[:3, :]
# Import PCA
# from sklearn.decomposition import PCA
# Create a PCA model with 2 components: pca
pca = PCA(n_components=2)
# Fit the PCA instance to the scaled samples
pca.fit(scaled_samples)
# Transform the scaled samples: pca_features
pca_features = pca.transform(scaled_samples)
# Print the shape of pca_features
pca_features.shape
In this exercise, you'll create a tf-idf word frequency array for a toy collection of documents. For this, use the TfidfVectorizer
from sklearn. It transforms a list of documents into a word frequency array, which it outputs as a csr_matrix. It has fit()
and transform()
methods like other sklearn objects.
You are given a list documents of toy documents about pets. Its contents have been printed in the IPython Shell.
Instructions
TfidfVectorizer
from sklearn.feature_extraction.text
.TfidfVectorizer
instance called tfidf
..fit_transform()
method of tfidf
to documents
and assign the result to csr_mat
. This is a word-frequency array in csr_matrix format.csr_mat
by calling its .toarray()
method and printing the result. This has been done for you..get_feature_names()
method of tfidf
, and assign the result to words
.documents = ['cats say meow', 'dogs say woof', 'dogs chase cats']
# Import TfidfVectorizer
# from sklearn.feature_extraction.text import TfidfVectorizer
# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer()
# Apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(documents)
# Print result of toarray() method
print(csr_mat.toarray())
# Get the words: words
words = tfidf.get_feature_names()
# Print words
print(words)
You saw in the video that TruncatedSVD
is able to perform PCA on sparse arrays in csr_matrix format, such as word-frequency arrays. Combine your knowledge of TruncatedSVD and k-means to cluster some popular pages from Wikipedia. In this exercise, build the pipeline. In the next exercise, you'll apply it to the word-frequency array of some Wikipedia articles.
Create a Pipeline object consisting of a TruncatedSVD followed by KMeans. (This time, we've precomputed the word-frequency matrix for you, so there's no need for a TfidfVectorizer).
The Wikipedia dataset you will be working with was obtained from here.
Instructions
TruncatedSVD
from sklearn.decomposition
.KMeans
from sklearn.cluster
.make_pipeline
from sklearn.pipeline
.TruncatedSVD
instance called svd
with n_components=50
.KMeans
instance called kmeans
with n_clusters=6
.pipeline
consisting of svd
and kmeans
.# Perform the necessary imports
# from sklearn.decomposition import TruncatedSVD
# from sklearn.cluster import KMeans
# from sklearn.pipeline import make_pipeline
# Create a TruncatedSVD instance: svd
svd = TruncatedSVD(n_components=50)
# Create a KMeans instance: kmeans
kmeans = KMeans(n_clusters=6)
# Create a pipeline: pipeline
pipeline = make_pipeline(svd, kmeans)
It is now time to put your pipeline from the previous exercise to work! You are given an array articles
of tf-idf word-frequencies of some popular Wikipedia articles, and a list titles
of their titles. Use your pipeline to cluster the Wikipedia articles.
A solution to the previous exercise has been pre-loaded for you, so a Pipeline pipeline
chaining TruncatedSVD with KMeans is available.
Instructions
pandas
as pd
.articles
.titles
of article titles by creating a DataFrame df
with labels
and titles
as columns. This has been done for you..sort_values()
method of df
to sort the DataFrame by the 'label'
column, and print the result.wik1.shape
wik1.iloc[:5, :5]
articles = csr_matrix(wik1)
articles.shape
titles = wik1.index
print(titles)
# Import pandas
# import pandas as pd
# Fit the pipeline to articles
pipeline.fit(wik1)
# Calculate the cluster labels: labels
labels = pipeline.predict(wik1)
# Create a DataFrame aligning labels and titles: df
df = pd.DataFrame({'label': labels, 'article': titles})
# Display df sorted by cluster label
df.sort_values(['label', 'article'])
In this chapter, you'll learn about a dimension reduction technique called "Non-negative matrix factorization" ("NMF") that expresses samples as combinations of interpretable parts. For example, it expresses documents as combinations of topics, and images in terms of commonly occurring visual patterns. You'll also learn to use NMF to build recommender systems that can find you similar articles to read, or musical artists that match your listening history!
Interpretable parts
Using scikit-learn NMF
fit
/transform
pattern as PCA.Example word-frequency array
Example usage of NMF
NMF components
NMF features
Reconstruction of a sample
Sample reconstruction
NMF fits to non-negative data only
wik1
dataset is used.wik1
has 13125 columns, while the toy example had 4model = NMF(n_components=6, init=None)
model.fit(wik1_sparse)
nmf_features = model.transform(wik1_sparse)
model.components_
# just the first 6 features
nmf_features[:6]
sample_row = wik1.loc['Climate change', :].to_numpy()
nmf_features[14, :].reshape((6, 1))
reconstruction = np.sum(nmf_features[14, :].reshape((6, 1)) * model.components_, axis=0)
reconstruction
df_exp = pd.DataFrame({'original value': sample_row, 'reconstructed value': reconstruction})
df_exp[df_exp['original value'].gt(0.15)]
wik2
contains the columns names of wik1
(i.e. the feature terms)wik2.iloc[[1865, 2078, 5216, 5818, 11866], :]
Which of the following 2-dimensional arrays are examples of non-negative data?
A tf-idf word-frequency array. An array daily stock market price movements (up and down), where each row represents a company. An array where rows are customers, columns are products and entries are 0 or 1, indicating whether a customer has purchased a product.
Possible Answers
In the video, you saw NMF applied to transform a toy word-frequency array. Now it's your turn to apply NMF, this time using the tf-idf word-frequency array of Wikipedia articles, given as a csr matrix articles
. Here, fit the model and transform the articles. In the next exercise, you'll explore the result.
Instructions
NMF
from sklearn.decomposition
.NMF
instance called model
with 6
components.articles
..transform()
method of model
to transform articles
, and assign the result to nmf_features
.nmf_features
to get a first idea what it looks like (.round(2)
rounds the entries to 2 decimal places.)articles = wik1_sparse
# Import NMF
# from sklearn.decomposition import NMF
# Create an NMF instance: model
model = NMF(n_components=6, init=None)
# Fit the model to articles
model.fit(articles)
# Transform the articles: nmf_features
nmf_features = model.transform(articles)
# Print the NMF features
print(nmf_features.round(2))
Now you will explore the NMF features you created in the previous exercise. A solution to the previous exercise has been pre-loaded, so the array nmf_features
is available. Also available is a list titles
giving the title of each Wikipedia article.
When investigating the features, notice that for both actors, the NMF feature 3 has by far the highest value. This means that both articles are reconstructed using mainly the 3rd NMF component. In the next video, you'll see why: NMF components represent topics (for instance, acting!).
Instructions
pandas
as pd
.df
from nmf_features
using pd.DataFrame()
. Set the index to titles
using index=titles
..loc[]
accessor of df
to select the row with title 'Anne Hathaway'
, and print the result. These are the NMF features for the article about the actress Anne Hathaway.'Denzel Washington'
(another actor).# Create a pandas DataFrame: df
df = pd.DataFrame(nmf_features, index=wik1.index)
display(df.head())
# Print the row for 'Anne Hathaway'
display(df.loc['Anne Hathaway'].to_frame())
# Print the row for 'Denzel Washington'
display(df.loc['Denzel Washington'].to_frame())
Notice that for both actors, the NMF feature 3 has by far the highest value. This means that both articles are reconstructed using mainly the 3rd NMF component. In the next video, you'll see why: NMF components represent topics (for instance, acting!).
In this exercise, you'll check your understanding of how NMF reconstructs samples from its components using the NMF feature values. On the right are the components of an NMF model. If the NMF feature values of a sample are [2, 1]
, then which of the following is most likely to represent the original sample? A pen and paper will help here! You have to apply the same technique Ben used in the video to reconstruct the sample [0.1203 0.1764 0.3195 0.141]
.
Possible Answers
[2.2, 1.1, 2.1]
[0.5, 1.6, 3.1]
[-4.0, 1.0, -2.0]
mc = np.array([[1., 0.5, 0. ], [0.2, 0.1, 2.1]])
f = np.array([[2], [1]])
np.sum(f * mc, axis=0)
Example: NMF learns interpretable parts
Applying NMF to the articles
NMF components are topics
NMF components
Grayscale images
Grayscale image example
Grayscale images as flat arrays
Encoding a collection of images
Visualizing samples
sample = np.array([0, 1, 0.5, 1, 0, 1])
bitmap = sample.reshape((2, 3))
plt.imshow(bitmap, cmap='gray', interpolation='nearest')
plt.show()
In the video, you learned when NMF is applied to documents, the components correspond to topics of documents, and the NMF features reconstruct the documents from the topics. Verify this for yourself for the NMF model that you built earlier using the Wikipedia articles. Previously, you saw that the 3rd NMF feature value was high for the articles about actors Anne Hathaway and Denzel Washington. In this exercise, identify the topic of the corresponding NMF component.
The NMF model you built earlier is available as model
, while words
is a list of the words that label the columns of the word-frequency array.
After you are done, take a moment to recognise the topic that the articles about Anne Hathaway and Denzel Washington have in common!
Instructions
pandas
as pd
.components_df
from model.components_
, setting columns=words
so that columns are labeled by the words.components_df.shape
to check the dimensions of the DataFrame..iloc[]
accessor on the DataFrame components_df
to select row 3
. Assign the result to component
..nlargest()
method of component
, and print the result. This gives the five words with the highest values for that component.words = wik2[0].tolist()
# Create a DataFrame: components_df
components_df = pd.DataFrame(model.components_, columns=words)
display(components_df.iloc[:5, :5])
# Print the shape of the DataFrame
print(components_df.shape)
# Select row 3: component
component = components_df.iloc[3, :]
# Print result of nlargest
component.nlargest()
Take a moment to recognise the topics that the articles about Anne Hathaway and Denzel Washington have in common!
In the following exercises, you'll use NMF to decompose grayscale images into their commonly occurring patterns. Firstly, explore the image dataset and see how it is encoded as an array. You are given 100 images as a 2D array samples
, where each row represents a single 13x8 image. The images in your dataset are pictures of a LED digital display.
Instructions
matplotlib.pyplot
as plt
.0
of samples
and assign the result to digit
. For example, to select column 2
of an array a
, you could use a[:,2]
. Remember that since samples
is a NumPy array, you can't use the .loc[]
or iloc[]
accessors to select specific rows or columns.digit
. This has been done for you. Notice that it is a 1D array of 0s and 1s..reshape()
method of digit
to get a 2D array with shape (13, 8)
. Assign the result to bitmap
.bitmap
, and notice that the 1s show the digit 7!plt.imshow()
function to display bitmap
as an image.samples = lcd.to_numpy()
# Select the 0th row: digit
digit = samples[0]
# Print digit
print(digit)
# Reshape digit to a 13x8 array: bitmap
bitmap = digit.reshape((13, 8))
# Print bitmap
print(bitmap)
# Use plt.imshow to display bitmap
plt.imshow(bitmap, cmap='gray', interpolation='nearest')
plt.colorbar()
plt.show()
You'll explore this dataset further in the next exercise and see for yourself how NMF can learn the parts of images.
Now use what you've learned about NMF to decompose the digits dataset. You are again given the digit images as a 2D array samples
. This time, you are also provided with a function show_as_image()
that displays the image encoded by any 1D array:
def show_as_image(sample):
bitmap = sample.reshape((13, 8))
plt.figure()
plt.imshow(bitmap, cmap='gray', interpolation='nearest')
plt.colorbar()
plt.show()
After you are done, take a moment to look through the plots and notice how NMF has expressed the digit as a sum of the components!
Instructions
NMF
from sklearn.decomposition
.NMF
instance called model
with 7
components. (7 is the number of cells in an LED display)..fit_transform()
method of model
to samples
. Assign the result to features
.model.components_
), apply the show_as_image()
function to that component inside the loop.0
of features to digit_features
.digit_features
.def show_as_image(sample):
bitmap = sample.reshape((13, 8))
plt.figure()
plt.imshow(bitmap, cmap='gray', interpolation='nearest')
plt.colorbar()
plt.show()
# Import NMF
# from sklearn.decomposition import NMF
# Create an NMF model: model
model = NMF(n_components=7, init=None)
# Apply fit_transform to samples: features
features = model.fit_transform(samples)
# Call show_as_image on each component
for component in model.components_:
show_as_image(component)
# Assign the 0th row of features: digit_features
digit_features = features[0, :]
# Print digit_features
print(digit_features)
Take a moment to look through the plots and notice how NMF has expressed the digit as a sum of the components!
Unlike NMF, PCA doesn't learn the parts of things. Its components do not correspond to topics (in the case of documents) or to parts of images, when trained on images. Verify this for yourself by inspecting the components of a PCA model fit to the dataset of LED digit images from the previous exercise. The images are available as a 2D array samples
. Also available is a modified version of the show_as_image()
function which colors a pixel red if the value is negative.
After submitting the answer, notice that the components of PCA do not represent meaningful parts of images of LED digits!
Instructions
PCA
from sklearn.decomposition
.PCA
instance called model
with 7
components..fit_transform()
method of model
to samples
. Assign the result to features
.model.components_
), apply the show_as_image()
function to that component inside the loop.# Import PCA
# from sklearn.decomposition import PCA
# Create a PCA instance: model
model = PCA(n_components=7)
# Apply fit_transform to samples: features
features = model.fit_transform(samples)
# Call show_as_image on each component
for component in model.components_:
show_as_image(component)
Notice that the components of PCA do not represent meaningful parts of images of LED digits!
Finding similar articles
Strategy
Apply NMF to the word-frequency array
Strategy
Versions of articles
Cosine similarity
Calculating the cosine similarities
DataFrames and labels
DataFrames and labels
wik1
dataset is used.nmf = NMF(n_components=6, init=None)
nmf_features = nmf.fit_transform(wik1_sparse)
norm_features = normalize(nmf_features)
current_article = norm_features[45, :]
similarities = norm_features.dot(current_article)
print(similarities)
norm_features = normalize(nmf_features)
df = pd.DataFrame(norm_features, index=wik1.index)
current_article = df.loc['Hepatitis C']
similarities = df.dot(current_article)
similarities.nlargest(10)
In the video, you learned how to use NMF features and the cosine similarity to find similar articles. Apply this to your NMF model for popular Wikipedia articles, by finding the articles most similar to the article about the footballer Cristiano Ronaldo. The NMF features you obtained earlier are available as nmf_features
, while titles
is a list of the article titles.
Instructions
normalize
from sklearn.preprocessing
.normalize()
function to nmf_features
. Store the result as norm_features
.df
from norm_features
, using titles
as an index..loc[]
accessor of df
to select the row of 'Cristiano Ronaldo'
. Assign the result to article
..dot()
method of df
to article
to calculate the cosine similarity of every row with article
..nlargest()
method of similarities
to display the most similiar articles. This has been done for you, so hit 'Submit Answer' to see the result!# Perform the necessary imports
# import pandas as pd
# from sklearn.preprocessing import normalize
# Normalize the NMF features: norm_features
norm_features = normalize(nmf_features)
# Create a DataFrame: df
df = pd.DataFrame(norm_features, index=wik1.index)
# Select the row corresponding to 'Cristiano Ronaldo': article
article = df.loc['Cristiano Ronaldo']
# Compute the dot products: similarities
similarities = df.dot(article)
# Display those with the largest cosine similarity
print(similarities.nlargest())
You may need to know a little about football (or soccer, depending on where you're from!) to be able to evaluate for yourself the quality of the computed similarities!
In this exercise and the next, you'll use what you've learned about NMF to recommend popular music artists! You are given a sparse array artists
whose rows correspond to artists and whose columns correspond to users. The entries give the number of times each artist was listened to by each user.
In this exercise, build a pipeline and transform the array into normalized NMF features. The first step in the pipeline, MaxAbsScaler
, transforms the data so that all users have the same influence on the model, regardless of how many different artists they've listened to. In the next exercise, you'll use the resulting normalized NMF features for recommendation!
Instructions
NMF
from sklearn.decomposition
.Normalizer
and MaxAbsScaler
from sklearn.preprocessing
.make_pipeline
from sklearn.pipeline
.MaxAbsScaler
called scaler
.NMF
instance with 20
components called nmf
.Normalizer
called normalizer
.pipeline
that chains together scaler
, nmf
, and normalizer
..fit_transform()
method of pipeline
to artists
. Assign the result to norm_features
.# Perform the necessary imports
# from sklearn.decomposition import NMF
# from sklearn.preprocessing import Normalizer, MaxAbsScaler
# from sklearn.pipeline import make_pipeline
# Create a MaxAbsScaler: scaler
scaler = MaxAbsScaler()
# Create an NMF model: nmf
nmf = NMF(n_components=20, init=None)
# Create a Normalizer: normalizer
normalizer = Normalizer()
# Create a pipeline: pipeline
pipeline = make_pipeline(scaler, nmf, normalizer)
# Apply fit_transform to artists: norm_features
norm_features = pipeline.fit_transform(artists_sparse)
Suppose you were a big fan of Bruce Springsteen - which other musicial artists might you like? Use your NMF features from the previous exercise and the cosine similarity to find similar musical artists. A solution to the previous exercise has been run, so norm_features
is an array containing the normalized NMF features as rows. The names of the musical artists are available as the list artist_names
.
Instructions
pandas
as pd
.df
from norm_features
, using artist_names
as an index..loc[]
accessor of df
to select the row of 'Bruce Springsteen'
. Assign the result to artist
..dot()
method of df
to artist
to calculate the dot product of every row with artist
. Save the result as similarities
..nlargest()
method of similarities
to display the artists most similar to 'Bruce Springsteen'
.# Import pandas
# import pandas as pd
# Create a DataFrame: df
df = pd.DataFrame(norm_features, index=artist_names)
# Select row of 'Bruce Springsteen': artist
artist = df.loc['Bruce Springsteen']
# Compute cosine similarities: similarities
similarities = df.dot(artist)
# Display those with highest cosine similarity
similarities.nlargest()
You've learned all about Unsupervised Learning, and applied the techniques to real-world datasets, and built your knowledge of Python along the way. In particular, you've become a whiz at using scikit-learn and scipy for unsupervised learning challenges. You have harnessed both clustering and dimension reduction techniques to tackle serious problems with real-world datasets, such as clustering Wikipedia documents by the words they contain, and recommending musical artists to consumers.