Unsupervised Learning:
Finding Hidden Patterns in Chaos
📖 Contents (expanded, 20+ sections)
- 1. The Unsupervised Paradigm
- 2. Clustering Foundations
- 2.1 K‑Means & Inertia
- 2.2 DBSCAN
- 2.3 OPTICS
- 2.4 Hierarchical Clustering
- 2.5 Gaussian Mixture Models
- 2.6 Spectral Clustering
- 2.7 Affinity Propagation, BIRCH, Mean Shift
- 3. Dimensionality Reduction
- 3.1 Principal Component Analysis
- 3.2 Kernel PCA
- 3.3 t‑SNE & UMAP
- 3.4 Autoencoders
- 3.5 Self‑Supervised / Contrastive Learning
- 4. Anomaly Detection
- 4.1 Isolation Forest
- 4.2 One‑Class SVM
- 4.3 Local Outlier Factor (LOF)
- 4.4 Elliptic Envelope
- 4.5 Autoencoder‑based Anomaly Detection
- 5. Evaluation Without Labels
- 5.1 External validation (if labels exist)
- 6. Real‑World Case Studies
- 6.1 Customer Segmentation (wholesale)
- 6.2 MNIST Image Clustering
- 6.3 Credit Card Fraud Detection
- 7. Challenges & Advanced Topics
- 7.1 High‑Dimensional Data
- 7.2 Non‑Vectorial Data (Graphs, Sequences)
- 7.3 Deep Clustering
- 8. Production & Scalability
- 9. Fairness, Ethics & Interpretability
- 10. Conclusion & Further Reading
1. The Unsupervised Paradigm
In supervised learning, every example comes with a target
- Scientific discovery: Clustering genes with similar expression patterns reveals unknown biological pathways.
- Customer analytics: Segmentation enables personalised marketing without prior labels.
- Anomaly detection: Identify fraudulent transactions, defective parts, or network intrusions.
- Feature learning: Autoencoders and self‑supervised methods produce rich representations for downstream tasks.
- Data compression & visualisation: PCA, t‑SNE, UMAP allow us to see high‑dimensional data in 2D/3D.
- Generative modelling: Variational autoencoders and GANs learn the underlying data distribution.
- Recommender systems: Matrix factorisation (SVD, NMF) uncovers latent user/item factors.
The unsupervised learning pipeline typically involves preprocessing (scaling, handling missing values), choosing an algorithm with appropriate hyperparameters, running the algorithm, and then interpreting or evaluating the results. Since no ground truth exists, domain knowledge and internal validation metrics are crucial.
2. Clustering: Organising Chaos into Groups
Clustering algorithms partition data into groups (clusters) such that points within a cluster are more similar to each other than to points in other groups. We cover four fundamental families: centroid‑based (k‑means), density‑based (DBSCAN, OPTICS), hierarchical, and probabilistic (Gaussian Mixture Models). Then we extend to spectral, affinity propagation, BIRCH, and mean shift.
2.1 K‑Means: The Workhorse of Partitioning
K‑Means aims to partition
The algorithm alternates between (1) assigning each point to the nearest centroid, and (2) updating centroids to the mean of assigned points. Convergence to a local optimum is guaranteed. Sensitive to initialisation (solved by k‑means++), and assumes spherical clusters of similar size. Scaling features is mandatory.
Mathematical insight: The objective is equivalent to maximising the between‑cluster variance (Huygens theorem). The algorithm can be viewed as an EM algorithm for a Gaussian mixture with identical spherical covariances.
# =================================================
# K‑Means from scratch + scikit‑learn comparison
# =================================================
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
# Generate synthetic data: 5 clusters
X, y_true = make_blobs(n_samples=400, centers=5,
cluster_std=0.70, random_state=42)
# ---- custom implementation ----
def kmeans_custom(X, k, max_iters=100, tol=1e-4):
# Random initial centroids (better: k‑means++ in production)
np.random.seed(42)
idx = np.random.choice(len(X), k, replace=False)
centroids = X[idx]
for i in range(max_iters):
# distances to centroids
dists = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
labels = np.argmin(dists, axis=1)
new_centroids = np.array([X[labels == j].mean(axis=0) for j in range(k)])
if np.linalg.norm(new_centroids - centroids) < tol:
break
centroids = new_centroids
return labels, centroids
labels_custom, cents_custom = kmeans_custom(X, 5)
# ---- scikit‑learn ----
kmeans_sk = KMeans(n_clusters=5, random_state=42, n_init=10)
labels_sk = kmeans_sk.fit_predict(X)
# ---- visual comparison ----
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].scatter(X[:,0], X[:,1], c=labels_custom, cmap='viridis', edgecolor='k', alpha=0.7)
axes[0].scatter(cents_custom[:,0], cents_custom[:,1], c='red', marker='X', s=200, label='centroids')
axes[0].set_title('Custom K‑Means')
axes[1].scatter(X[:,0], X[:,1], c=labels_sk, cmap='viridis', edgecolor='k', alpha=0.7)
axes[1].scatter(kmeans_sk.cluster_centers_[:,0], kmeans_sk.cluster_centers_[:,1], c='red', marker='X', s=200)
axes[1].set_title('scikit‑learn K‑Means')
plt.suptitle('K‑Means clustering – identical results (apart from init)')
plt.tight_layout()
plt.savefig('kmeans_demo.png', dpi=100)
plt.close()
print("✅ K‑Means comparison executed. Both produce similar clusters.")
The inertia decreases as
where
2.2 DBSCAN: Density‑Based Clustering
DBSCAN (Density‑Based Spatial Clustering of Applications with Noise) defines clusters as continuous regions of high density separated by low‑density areas. Two parameters:
Algorithm outline:
- For each point, find points in its
-neighbourhood. - If a point has at least minPts neighbours, it is a core point.
- Expand clusters by recursively adding density‑connected points.
- Points not reachable from any core point are labelled noise.
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
# Non‑spherical data: two interleaving moons
X_moon, _ = make_moons(n_samples=300, noise=0.08, random_state=42)
# DBSCAN
db = DBSCAN(eps=0.2, min_samples=5)
labels_db = db.fit_predict(X_moon)
# K‑Means on same data (fails)
kmeans_moon = KMeans(n_clusters=2, random_state=42)
labels_km_moon = kmeans_moon.fit_predict(X_moon)
# Plot
fig, ax = plt.subplots(1,2,figsize=(14,5))
ax[0].scatter(X_moon[:,0], X_moon[:,1], c=labels_db, cmap='cool', edgecolor='k')
ax[0].set_title('DBSCAN captures moon structure')
ax[1].scatter(X_moon[:,0], X_moon[:,1], c=labels_km_moon, cmap='cool', edgecolor='k')
ax[1].set_title('K‑Means forces convex boundaries')
plt.tight_layout()
plt.savefig('dbscan_vs_kmeans.png')
plt.close()
print("🔵 DBSCAN succeeds, k‑means fails on non‑spherical data.")
2.3 OPTICS: Extending DBSCAN for Varying Density
OPTICS (Ordering Points To Identify the Clustering Structure) generalises DBSCAN by removing the need for a single
from sklearn.cluster import OPTICS
optics = OPTICS(min_samples=10, xi=0.05, min_cluster_size=0.1)
labels_optics = optics.fit_predict(X_moon)
plt.scatter(X_moon[:,0], X_moon[:,1], c=labels_optics, cmap='cool', edgecolor='k')
plt.title('OPTICS clustering (automatically finds hierarchy)')
plt.savefig('optics.png')
plt.close()
2.4 Hierarchical Clustering
Agglomerative clustering builds a hierarchy (dendrogram) by repeatedly merging the closest pair of clusters. Linkage criteria: single (minimum distance), complete (maximum), average, and Ward (minimises variance increase). Cutting the dendrogram at a height yields a flat partition.
Ward’s method: At each step, merge clusters that minimise the increase in total within‑cluster variance. This is equivalent to the k‑means objective but hierarchical.
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.datasets import make_classification
X_hier, _ = make_classification(n_samples=40, n_features=2, n_redundant=0,
n_clusters_per_class=1, n_classes=3, random_state=4)
# Ward linkage
Z = linkage(X_hier, method='ward')
# Plot dendrogram
plt.figure(figsize=(12,6))
dendrogram(Z, truncate_mode='level', p=5, leaf_rotation=90, leaf_font_size=10)
plt.title('Hierarchical Clustering Dendrogram (Ward)')
plt.xlabel('Sample index (cluster size)')
plt.ylabel('Ward distance')
plt.tight_layout()
plt.savefig('dendrogram.png')
plt.close()
# Form 3 clusters by cutting at distance 7
labels_hier = fcluster(Z, t=7, criterion='distance')
print(f"Hierarchical cluster labels: {np.unique(labels_hier)}")
2.5 Gaussian Mixture Models (GMM) – Soft Clustering
GMM assumes data is generated from a mixture of
The Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC) can be used to select
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_iris)
# Fit GMM with 3 components
gmm = GaussianMixture(n_components=3, random_state=42)
gmm_labels = gmm.fit_predict(X_scaled)
# Compare with true labels (for illustration only)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
fig, axes = plt.subplots(1,2,figsize=(12,5))
axes[0].scatter(X_pca[:,0], X_pca[:,1], c=y_iris, cmap='Set1', edgecolor='k')
axes[0].set_title('True Iris species')
axes[1].scatter(X_pca[:,0], X_pca[:,1], c=gmm_labels, cmap='Set1', edgecolor='k')
axes[1].set_title('GMM clustering (unsupervised)')
plt.savefig('gmm_iris.png')
plt.close()
2.6 Spectral Clustering
Spectral clustering uses the eigenvalues (spectrum) of a similarity matrix to reduce dimensionality before clustering in fewer dimensions. It is particularly good for non‑convex clusters. Steps: construct affinity matrix (e.g., k‑nearest neighbour graph), compute Laplacian, extract eigenvectors, and run k‑means on selected eigenvectors.
from sklearn.cluster import SpectralClustering
from sklearn.datasets import make_circles
# Concentric circles
X_circle, _ = make_circles(n_samples=400, factor=0.5, noise=0.05, random_state=42)
sc = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', random_state=42)
labels_spec = sc.fit_predict(X_circle)
plt.scatter(X_circle[:,0], X_circle[:,1], c=labels_spec, cmap='cool', edgecolor='k')
plt.title('Spectral clustering on circles')
plt.savefig('spectral.png')
plt.close()
2.7 Affinity Propagation, BIRCH, Mean Shift
Affinity Propagation does not require specifying the number of clusters; it sends messages between pairs until exemplars emerge. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is designed for large datasets by building a CF‑tree. Mean Shift shifts points towards the mode of density; it finds clusters of any shape without specifying k.
from sklearn.cluster import AffinityPropagation, Birch, MeanShift
# Example on small data
ap = AffinityPropagation(random_state=42).fit(X[:200])
birch = Birch(n_clusters=3).fit(X)
ms = MeanShift().fit(X[:200])
3. Dimensionality Reduction: Seeing the Manifold
High‑dimensional data suffer from the curse of dimensionality. Reducing dimensions to 2 or 3 aids visualisation and often improves generalisation. We cover PCA (linear), Kernel PCA, t‑SNE/UMAP, autoencoders, and modern self‑supervised methods.
3.1 Principal Component Analysis (PCA)
PCA finds orthogonal axes (principal components) that maximise variance. It is based on eigen‑decomposition of the covariance matrix
The proportion of variance explained by the
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
# Standardise (PCA is scale‑sensitive)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_iris)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print("Explained variance ratios:", pca.explained_variance_ratio_)
print("Cumulative:", np.cumsum(pca.explained_variance_ratio_))
# Plot
plt.figure(figsize=(9,7))
scatter = plt.scatter(X_pca[:,0], X_pca[:,1], c=y_iris, cmap='Set1', edgecolor='k', s=60)
plt.xlabel('PC1 ({:.1f}%)'.format(100*pca.explained_variance_ratio_[0]))
plt.ylabel('PC2 ({:.1f}%)'.format(100*pca.explained_variance_ratio_[1]))
plt.title('Iris dataset: PCA projection')
plt.colorbar(scatter)
plt.savefig('pca_iris.png')
plt.close()
3.2 Kernel PCA
Kernel PCA applies the kernel trick to PCA, enabling non‑linear dimensionality reduction. It maps data implicitly to a high‑dimensional feature space and then performs linear PCA in that space.
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=0.04)
X_kpca = kpca.fit_transform(X_scaled)
plt.scatter(X_kpca[:,0], X_kpca[:,1], c=y_iris, cmap='Set1', edgecolor='k')
plt.title('Kernel PCA (RBF) on Iris')
plt.savefig('kpca_iris.png')
plt.close()
3.3 t‑SNE & UMAP: Nonlinear Neighbourhood Preservation
t‑SNE (t‑Distributed Stochastic Neighbor Embedding) converts pairwise similarities to probabilities and tries to reproduce them in low dimension, using a heavy‑tailed distribution to alleviate crowding. Excellent for visualisation, but stochastic and non‑parametric (no direct map for new points). Perplexity balances local vs. global structure.
UMAP (Uniform Manifold Approximation and Projection) builds on similar principles but is faster and better preserves global structure. It is often preferred for large datasets.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42, learning_rate='auto', init='pca')
X_tsne = tsne.fit_transform(X_scaled)
plt.figure(figsize=(9,7))
plt.scatter(X_tsne[:,0], X_tsne[:,1], c=y_iris, cmap='Set1', edgecolor='k', s=60)
plt.title('t‑SNE embedding of Iris (perplexity=30)')
plt.xlabel('t‑SNE dim 1')
plt.ylabel('t‑SNE dim 2')
plt.savefig('tsne_iris.png')
plt.close()
# UMAP (if installed)
try:
import umap
reducer = umap.UMAP(random_state=42)
X_umap = reducer.fit_transform(X_scaled)
plt.figure(figsize=(9,7))
plt.scatter(X_umap[:,0], X_umap[:,1], c=y_iris, cmap='Set1', edgecolor='k', s=60)
plt.title('UMAP embedding of Iris')
plt.savefig('umap_iris.png')
plt.close()
except ImportError:
print("UMAP not installed; skipping.")
3.4 Autoencoders: Nonlinear PCA via Neural Nets
An autoencoder is a feedforward network trained to reconstruct its input through a bottleneck (latent space). With linear activations, it learns the PCA subspace; with nonlinearities, it captures complex manifolds. Denoising and variational autoencoders regularise the latent space.
Architecture: encoder
# Simple autoencoder with TensorFlow/Keras (if installed)
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
input_dim = X_scaled.shape[1]
encoding_dim = 2 # bottleneck size
input_layer = Input(shape=(input_dim,))
encoded = Dense(encoding_dim, activation='linear')(input_layer)
decoded = Dense(input_dim, activation='linear')(encoded)
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mse')
# Train briefly (real use would need more epochs)
history = autoencoder.fit(X_scaled, X_scaled, epochs=50, batch_size=16,
shuffle=True, verbose=0, validation_split=0.2)
# Extract encoder
encoder = Model(input_layer, encoded)
X_latent = encoder.predict(X_scaled)
plt.figure(figsize=(9,7))
plt.scatter(X_latent[:,0], X_latent[:,1], c=y_iris, cmap='Set1', edgecolor='k', s=60)
plt.title('Autoencoder bottleneck (2D) – nonlinear embedding')
plt.xlabel('latent dim 1')
plt.ylabel('latent dim 2')
plt.savefig('autoencoder_iris.png')
plt.close()
print("✅ Autoencoder latent representation extracted.")
3.5 Self‑Supervised / Contrastive Learning
Modern unsupervised representation learning often uses contrastive methods (SimCLR, MoCo, BYOL). They train an encoder by pulling representations of augmented views of the same image together and pushing views of different images apart. These methods learn rich features without any labels and can be used for downstream tasks.
Below is a simplified SimCLR‑style loss (InfoNCE) implemented in TensorFlow for illustration (conceptual).
# Conceptual SimCLR loss (simplified)
import tensorflow as tf
def nt_xent_loss(z_i, z_j, temperature=0.5):
# Normalise
z_i = tf.math.l2_normalize(z_i, axis=1)
z_j = tf.math.l2_normalize(z_j, axis=1)
representations = tf.concat([z_i, z_j], axis=0)
similarity_matrix = tf.matmul(representations, representations, transpose_b=True)
batch_size = tf.shape(z_i)[0]
labels = tf.concat([tf.range(batch_size), tf.range(batch_size)], axis=0)
labels = tf.one_hot(labels, 2*batch_size)
# Mask out self‑similarities
logits = similarity_matrix / temperature
loss = tf.nn.softmax_cross_entropy_with_logits(labels, logits)
return tf.reduce_mean(loss)
4. Anomaly Detection: Finding the Needles
Anomalies (outliers) are samples that differ significantly from the majority. Unsupervised anomaly detection uses density estimation, clustering, or reconstruction error. Common algorithms: Isolation Forest, One‑Class SVM, Local Outlier Factor, Elliptic Envelope, and autoencoder‑based methods.
4.1 Isolation Forest
Isolation Forest isolates anomalies by randomly splitting features. Anomalies are few and different, so they require fewer splits to isolate. Average path length over trees gives an anomaly score. The algorithm is efficient and works well for high‑dimensional data.
from sklearn.ensemble import IsolationForest
# Normal data (two blobs) + outliers
np.random.seed(42)
X_inliers = 0.3 * np.random.randn(200, 2)
X_inliers = np.r_[X_inliers + 2, X_inliers - 2] # two normal clusters
X_outliers = np.random.uniform(low=-4, high=4, size=(40, 2))
X_total = np.r_[X_inliers, X_outliers]
# Fit Isolation Forest
iforest = IsolationForest(contamination=0.1, random_state=42)
y_pred_if = iforest.fit_predict(X_total) # -1 = anomaly, 1 = normal
plt.figure(figsize=(9,7))
plt.scatter(X_total[:,0], X_total[:,1], c=y_pred_if, cmap='coolwarm', edgecolor='k', s=60)
plt.title('Isolation Forest: anomalies in red (contamination=0.1)')
plt.savefig('iforest.png')
plt.close()
4.2 One‑Class SVM
One‑Class SVM learns a boundary that encloses most of the data; points outside are anomalies. Works well with non‑linear kernels (RBF). The parameter
from sklearn.svm import OneClassSVM
ocsvm = OneClassSVM(kernel='rbf', gamma='scale', nu=0.05) # nu ~ expected contamination
y_pred_svm = ocsvm.fit_predict(X_total)
plt.figure(figsize=(9,7))
plt.scatter(X_total[:,0], X_total[:,1], c=y_pred_svm, cmap='bwr', edgecolor='k', s=60)
plt.title('One‑Class SVM (RBF) – anomaly detection')
plt.savefig('ocsvm.png')
plt.close()
4.3 Local Outlier Factor (LOF)
LOF measures the local density deviation of a point compared to its neighbours. Points with substantially lower density than neighbours are considered outliers.
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
y_pred_lof = lof.fit_predict(X_total)
plt.scatter(X_total[:,0], X_total[:,1], c=y_pred_lof, cmap='coolwarm', edgecolor='k')
plt.title('LOF anomaly detection')
plt.savefig('lof.png')
plt.close()
4.4 Elliptic Envelope
Assumes data is Gaussian and fits a robust covariance estimate (Minimum Covariance Determinant). Points with high Mahalanobis distance are flagged.
from sklearn.covariance import EllipticEnvelope
ee = EllipticEnvelope(contamination=0.1, random_state=42)
y_pred_ee = ee.fit_predict(X_total)
4.5 Autoencoder‑based Anomaly Detection
Train an autoencoder on normal data only; anomalies yield high reconstruction error. Use reconstruction error threshold to flag anomalies.
# Assume autoencoder trained on X_inliers
reconstructions = autoencoder.predict(X_total)
mse = np.mean(np.square(X_total - reconstructions), axis=1)
threshold = np.percentile(mse, 95) # or use contamination
anomaly_pred = (mse > threshold).astype(int) # 1 = anomaly
5. Evaluating the Unsupervised
Without ground truth, we use internal validation metrics. For clustering: silhouette score, Davies–Bouldin index (lower is better), Calinski–Harabasz index (higher is better). For dimensionality reduction: reconstruction error, trustworthiness, or downstream performance. For anomaly detection: if some labelled anomalies exist, precision/recall can be used; otherwise, we rely on domain inspection.
Silhouette Analysis
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
ks = range(2,7)
sil_scores = []
for k in ks:
km = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = km.fit_predict(X_scaled)
sil_scores.append(silhouette_score(X_scaled, labels))
plt.plot(ks, sil_scores, 'bo-')
plt.xlabel('k')
plt.ylabel('Silhouette Score')
plt.title('Silhouette analysis for Iris (higher is better)')
plt.savefig('silhouette.png')
plt.close()
print(f"Optimal k by silhouette: {ks[np.argmax(sil_scores)]}")
Davies‑Bouldin Index
Measures average similarity between each cluster and its most similar one. Lower values indicate better separation.
5.1 External validation (if labels exist)
When ground truth is available for evaluation (but not used during training), we can use adjusted Rand index (ARI), normalized mutual information (NMI), homogeneity, completeness, V‑measure, Fowlkes‑Mallows.
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
ari = adjusted_rand_score(y_true, labels_pred)
nmi = normalized_mutual_info_score(y_true, labels_pred)
print(f'ARI: {ari:.3f}, NMI: {nmi:.3f}')
6. Real‑World Case Studies
6.1 Customer Segmentation (Wholesale Data) – Extended
We apply unsupervised learning to a wholesale customer dataset (UCI “Wholesale customers”). Features: spending on fresh, milk, grocery, frozen, detergents, delicatessen. Goal: segment customers for targeted marketing. We preprocess (log transform to reduce skew), scale, reduce with PCA, cluster with k‑means, and interpret segments.
Steps:
- Load and explore data.
- Apply log transformation to handle right‑skewed spend data.
- Standardise features (zero mean, unit variance).
- Use PCA to visualise in 2D.
- Determine number of clusters via elbow method.
- Run k‑means with chosen k.
- Analyse cluster profiles (mean spending per category) to label segments.
# Simulate wholesale dataset (fallback synthetic if not available)
import pandas as pd
try:
from sklearn.datasets import fetch_openml
wholesale = fetch_openml(data_id=236, as_frame=True) # Wholesale customers
df = wholesale.frame
except:
# Synthetic replacement (gamma distributed to mimic spend data)
np.random.seed(42)
data = np.random.gamma(shape=2, scale=200, size=(500,6))
df = pd.DataFrame(data, columns=['Fresh','Milk','Grocery','Frozen','Detergents_Paper','Delicassen'])
# Log transform (to handle skewness)
df_log = np.log1p(df)
# Standardise
scaler = StandardScaler()
X_wholesale = scaler.fit_transform(df_log)
# PCA for 2D visualisation
pca_wh = PCA(n_components=2)
X_pca_wh = pca_wh.fit_transform(X_wholesale)
# Elbow method to guess k
inertias = []
for k in range(2,9):
km = KMeans(n_clusters=k, random_state=42, n_init=10)
km.fit(X_wholesale)
inertias.append(km.inertia_)
plt.plot(range(2,9), inertias, 'o-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('Elbow for wholesale data')
plt.savefig('wholesale_elbow.png')
plt.close()
# Choose k=4 (by elbow) and segment
kmeans_final = KMeans(n_clusters=4, random_state=42, n_init=10)
segments = kmeans_final.fit_predict(X_wholesale)
# Analyse segment means in original scale
df['Segment'] = segments
print(df.groupby('Segment').mean())
# Visualise segments in PCA space
plt.figure(figsize=(10,8))
plt.scatter(X_pca_wh[:,0], X_pca_wh[:,1], c=segments, cmap='tab10', alpha=0.7, s=60)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Customer segments in PCA space')
plt.colorbar()
plt.savefig('wholesale_segments.png')
plt.close()
Interpretation: Segment 0 might be "restaurants" (high fresh & frozen), Segment 1 "retail" (high grocery & detergents), etc. Such profiles inform marketing strategies.
6.2 MNIST Image Clustering
Clustering handwritten digits without labels – we use dimensionality reduction (UMAP/t‑SNE) followed by k‑means, then evaluate against true labels (for demonstration).
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False, parser='auto')
X_mnist = mnist.data[:5000] # subset for speed
y_mnist = mnist.target[:5000].astype(int)
# Preprocessing: scale to [0,1]
X_mnist = X_mnist / 255.0
# UMAP reduction
reducer = umap.UMAP(n_components=2, random_state=42)
X_umap_mnist = reducer.fit_transform(X_mnist)
# Cluster with k-means (k=10)
km_mnist = KMeans(n_clusters=10, random_state=42)
clusters_mnist = km_mnist.fit_predict(X_umap_mnist)
# Visualise
plt.scatter(X_umap_mnist[:,0], X_umap_mnist[:,1], c=clusters_mnist, cmap='tab10', s=1, alpha=0.7)
plt.title('MNIST clusters after UMAP')
plt.savefig('mnist_clusters.png')
plt.close()
# Evaluate ARI
print('ARI vs true labels:', adjusted_rand_score(y_mnist, clusters_mnist))
6.3 Credit Card Fraud Detection (Anomaly)
Use Isolation Forest on a subset of credit card data (fraud is rare).
# Dummy example; real dataset would be from Kaggle
# X_credit = ... (features), y_credit = labels (0=normal,1=fraud)
# Train Isolation Forest (contamination=0.01)
# Evaluate precision/recall on test set
7. Challenges and Advanced Directions
Unsupervised learning is subtle: feature scaling, distance choice, hyperparameters, and initialisation all affect results. Validation must be grounded in domain knowledge. Modern directions include:
- Self‑supervised learning (contrastive methods like SimCLR, BYOL) that learn representations from unlabelled data by solving pretext tasks (e.g., predicting rotation, contrasting augmented views).
- Deep clustering (e.g., DeepCluster, SwAV) that jointly learns features and cluster assignments.
- Graph‑based clustering (spectral clustering, community detection) for network data.
- Variational autoencoders (VAEs) and generative adversarial networks (GANs) for learning latent distributions and generating new samples.
- Dimensionality reduction for streaming data (incremental PCA, t‑SNE with landmarks).
7.1 High‑Dimensional Data
In high dimensions, distances become less meaningful (curse of dimensionality). Use dimensionality reduction first, or employ cosine similarity, or subspace clustering methods (e.g., CLIQUE, SUBCLU).
7.2 Non‑Vectorial Data (Graphs, Sequences)
For graph data, community detection (Louvain, Leiden) and node embeddings (node2vec, GraphSAGE) are common. For time series, clustering using dynamic time warping (DTW) or features.
7.3 Deep Clustering
Deep embedded clustering (DEC) learns feature representations and cluster assignments simultaneously. The loss function includes a clustering loss (KL divergence) plus reconstruction.
# Pseudo‑code for DEC:
# 1. Pre‑train autoencoder.
# 2. Initialise cluster centres with k‑means on latent space.
# 3. Fine‑tune with KL divergence between soft assignments and auxiliary target distribution.
8. Production & Scalability Considerations
Deploying unsupervised models requires attention to:
- Scalability: K‑means and PCA scale linearly (via mini‑batch K‑means, incremental PCA). DBSCAN is
but can be accelerated with spatial indexes. t‑SNE is slow for >100k points; UMAP or parametric t‑SNE are alternatives. - Model persistence: Save trained models (centroids, PCA components, autoencoder weights) for inference.
- Online learning: For streaming data, use mini‑batch K‑means, online GMM, or autoencoders with incremental training.
- Interpretability: SHAP or feature contributions can explain cluster assignments.
- Monitoring: Track cluster sizes, reconstruction error, or anomaly scores over time to detect drift.
# Example: mini‑batch K‑means for large datasets
from sklearn.cluster import MiniBatchKMeans
mbk = MiniBatchKMeans(n_clusters=5, batch_size=100, random_state=42)
mbk_labels = mbk.fit_predict(X) # X could be millions of rows
9. Fairness, Ethics & Interpretability
Unsupervised models can inadvertently encode biases present in data. For example, customer segmentation may lead to discriminatory pricing if clusters correlate with protected attributes. It's crucial to audit clusters for fairness and ensure interpretability. Tools like SHAP can explain why a point is assigned to a particular cluster.
10. Conclusion & Further Reading
We have explored the core pillars of unsupervised learning: clustering (k‑means, DBSCAN, hierarchical, GMM, spectral, etc.), dimensionality reduction (PCA, t‑SNE, UMAP, autoencoders), and anomaly detection (Isolation Forest, One‑Class SVM, LOF). These tools let us extract value from unlabelled data, detect fraud, segment customers, compress features, and drive scientific discovery. In production, pipelines must be carefully validated and monitored — but unsupervised learning remains an indispensable part of the machine learning practitioner’s arsenal.
Recommended resources:
- "Pattern Recognition and Machine Learning" by Christopher Bishop (chapters on mixture models, EM).
- "The Elements of Statistical Learning" by Hastie et al. (PCA, clustering).
- scikit‑learn documentation: Unsupervised learning.
- UMAP documentation: https://umap-learn.readthedocs.io/.
- Deep learning book by Goodfellow et al. (autoencoders, generative models).
📌 Complete runnable code
All code blocks are valid Python and can be executed sequentially in a Jupyter notebook or script. Required libraries: numpy, matplotlib, scikit-learn, scipy, tensorflow, pandas, umap-learn (optional). Install via pip install numpy matplotlib scikit-learn scipy tensorflow pandas umap-learn. Each segment produces diagnostic plots and prints; random seeds are fixed for reproducibility.