📘 Lesson 6 · 25,000+ words · Professional reference · ULTRA EXPANDED EDITION

Unsupervised Learning:
Finding Hidden Patterns in Chaos

No labels. No ground truth. Only data. Yet from this apparent chaos emerge clusters, manifolds, and latent structures that drive recommendation systems, genetic discovery, and fraud detection. This comprehensive guide provides the complete theoretical and practical toolkit — with production‑ready Python, advanced algorithms, and modern deep learning approaches.

📖 Contents (expanded, 20+ sections)

1. The Unsupervised Paradigm
2. Clustering Foundations
2.1 K‑Means & Inertia
2.2 DBSCAN
2.3 OPTICS
2.4 Hierarchical Clustering
2.5 Gaussian Mixture Models
2.6 Spectral Clustering
2.7 Affinity Propagation, BIRCH, Mean Shift
3. Dimensionality Reduction
3.1 Principal Component Analysis
3.2 Kernel PCA
3.3 t‑SNE & UMAP
3.4 Autoencoders
3.5 Self‑Supervised / Contrastive Learning
4. Anomaly Detection
4.1 Isolation Forest
4.2 One‑Class SVM
4.3 Local Outlier Factor (LOF)
4.4 Elliptic Envelope
4.5 Autoencoder‑based Anomaly Detection
5. Evaluation Without Labels
5.1 External validation (if labels exist)
6. Real‑World Case Studies
6.1 Customer Segmentation (wholesale)
6.2 MNIST Image Clustering
6.3 Credit Card Fraud Detection
7. Challenges & Advanced Topics
7.1 High‑Dimensional Data
7.2 Non‑Vectorial Data (Graphs, Sequences)
7.3 Deep Clustering
8. Production & Scalability
9. Fairness, Ethics & Interpretability
10. Conclusion & Further Reading

1. The Unsupervised Paradigm

In supervised learning, every example comes with a target $y$ . In unsupervised learning, we have only $X = {x_{1}, x_{2}, \dots, x_{n}}$ with no accompanying labels. The objective is to discover hidden structures: groups of similar points (clustering), low‑dimensional representations (dimensionality reduction), or points that deviate from the norm (anomaly detection). Applications range from customer segmentation and exploratory data analysis to feature learning and generative modeling. Because there is no direct error signal, evaluation often relies on intrinsic metrics (silhouette, reconstruction error) or downstream utility.

🧠 Why unsupervised learning matters (expanded):

Scientific discovery: Clustering genes with similar expression patterns reveals unknown biological pathways.
Customer analytics: Segmentation enables personalised marketing without prior labels.
Anomaly detection: Identify fraudulent transactions, defective parts, or network intrusions.
Feature learning: Autoencoders and self‑supervised methods produce rich representations for downstream tasks.
Data compression & visualisation: PCA, t‑SNE, UMAP allow us to see high‑dimensional data in 2D/3D.
Generative modelling: Variational autoencoders and GANs learn the underlying data distribution.
Recommender systems: Matrix factorisation (SVD, NMF) uncovers latent user/item factors.

The unsupervised learning pipeline typically involves preprocessing (scaling, handling missing values), choosing an algorithm with appropriate hyperparameters, running the algorithm, and then interpreting or evaluating the results. Since no ground truth exists, domain knowledge and internal validation metrics are crucial.

2. Clustering: Organising Chaos into Groups

Clustering algorithms partition data into groups (clusters) such that points within a cluster are more similar to each other than to points in other groups. We cover four fundamental families: centroid‑based (k‑means), density‑based (DBSCAN, OPTICS), hierarchical, and probabilistic (Gaussian Mixture Models). Then we extend to spectral, affinity propagation, BIRCH, and mean shift.

2.1 K‑Means: The Workhorse of Partitioning

K‑Means aims to partition $n$ observations into $k$ clusters by minimising the within‑cluster sum of squares (inertia):

\sum_{i = 1}^{k} \sum_{x \in C_{i}} ‖ x - μ_{i} ‖^{2}

The algorithm alternates between (1) assigning each point to the nearest centroid, and (2) updating centroids to the mean of assigned points. Convergence to a local optimum is guaranteed. Sensitive to initialisation (solved by k‑means++), and assumes spherical clusters of similar size. Scaling features is mandatory.

Mathematical insight: The objective is equivalent to maximising the between‑cluster variance (Huygens theorem). The algorithm can be viewed as an EM algorithm for a Gaussian mixture with identical spherical covariances.

# =================================================
# K‑Means from scratch + scikit‑learn comparison
# =================================================
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data: 5 clusters
X, y_true = make_blobs(n_samples=400, centers=5,
                       cluster_std=0.70, random_state=42)

# ---- custom implementation ----
def kmeans_custom(X, k, max_iters=100, tol=1e-4):
    # Random initial centroids (better: k‑means++ in production)
    np.random.seed(42)
    idx = np.random.choice(len(X), k, replace=False)
    centroids = X[idx]

    for i in range(max_iters):
        # distances to centroids
        dists = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
        labels = np.argmin(dists, axis=1)

        new_centroids = np.array([X[labels == j].mean(axis=0) for j in range(k)])

        if np.linalg.norm(new_centroids - centroids) < tol:
            break
        centroids = new_centroids
    return labels, centroids

labels_custom, cents_custom = kmeans_custom(X, 5)

# ---- scikit‑learn ----
kmeans_sk = KMeans(n_clusters=5, random_state=42, n_init=10)
labels_sk = kmeans_sk.fit_predict(X)

# ---- visual comparison ----
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].scatter(X[:,0], X[:,1], c=labels_custom, cmap='viridis', edgecolor='k', alpha=0.7)
axes[0].scatter(cents_custom[:,0], cents_custom[:,1], c='red', marker='X', s=200, label='centroids')
axes[0].set_title('Custom K‑Means')
axes[1].scatter(X[:,0], X[:,1], c=labels_sk, cmap='viridis', edgecolor='k', alpha=0.7)
axes[1].scatter(kmeans_sk.cluster_centers_[:,0], kmeans_sk.cluster_centers_[:,1], c='red', marker='X', s=200)
axes[1].set_title('scikit‑learn K‑Means')
plt.suptitle('K‑Means clustering – identical results (apart from init)')
plt.tight_layout()
plt.savefig('kmeans_demo.png', dpi=100)
plt.close()
print("✅ K‑Means comparison executed. Both produce similar clusters.")

The inertia decreases as $k$ increases; the elbow method plots inertia vs. $k$ . A more robust metric is the silhouette coefficient for a point $i$ :

s (i) = \frac{b (i) - a (i)}{max {a (i), b (i)}}

where $a (i)$ is the mean intra‑cluster distance, $b (i)$ the mean nearest‑cluster distance. Values near +1 indicate well‑separated clusters. We will later use silhouette to choose $k$ .

2.2 DBSCAN: Density‑Based Clustering

DBSCAN (Density‑Based Spatial Clustering of Applications with Noise) defines clusters as continuous regions of high density separated by low‑density areas. Two parameters: $ε$ (radius) and minPts. Points with at least minPts neighbours within $ε$ become core points; points reachable from core form clusters; isolated points are noise. DBSCAN finds arbitrarily shaped clusters and does not require specifying $k$ .

Algorithm outline:

For each point, find points in its $ε$ -neighbourhood.
If a point has at least minPts neighbours, it is a core point.
Expand clusters by recursively adding density‑connected points.
Points not reachable from any core point are labelled noise.

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

# Non‑spherical data: two interleaving moons
X_moon, _ = make_moons(n_samples=300, noise=0.08, random_state=42)

# DBSCAN
db = DBSCAN(eps=0.2, min_samples=5)
labels_db = db.fit_predict(X_moon)

# K‑Means on same data (fails)
kmeans_moon = KMeans(n_clusters=2, random_state=42)
labels_km_moon = kmeans_moon.fit_predict(X_moon)

# Plot
fig, ax = plt.subplots(1,2,figsize=(14,5))
ax[0].scatter(X_moon[:,0], X_moon[:,1], c=labels_db, cmap='cool', edgecolor='k')
ax[0].set_title('DBSCAN captures moon structure')
ax[1].scatter(X_moon[:,0], X_moon[:,1], c=labels_km_moon, cmap='cool', edgecolor='k')
ax[1].set_title('K‑Means forces convex boundaries')
plt.tight_layout()
plt.savefig('dbscan_vs_kmeans.png')
plt.close()
print("🔵 DBSCAN succeeds, k‑means fails on non‑spherical data.")

2.3 OPTICS: Extending DBSCAN for Varying Density

OPTICS (Ordering Points To Identify the Clustering Structure) generalises DBSCAN by removing the need for a single $ε$ . It creates an augmented ordering of the database representing its density-based clustering structure, which can be extracted at multiple thresholds. It uses two concepts: core-distance and reachability-distance. The result is a reachability plot that reveals hierarchical clusters.

from sklearn.cluster import OPTICS
optics = OPTICS(min_samples=10, xi=0.05, min_cluster_size=0.1)
labels_optics = optics.fit_predict(X_moon)

plt.scatter(X_moon[:,0], X_moon[:,1], c=labels_optics, cmap='cool', edgecolor='k')
plt.title('OPTICS clustering (automatically finds hierarchy)')
plt.savefig('optics.png')
plt.close()

2.4 Hierarchical Clustering

Agglomerative clustering builds a hierarchy (dendrogram) by repeatedly merging the closest pair of clusters. Linkage criteria: single (minimum distance), complete (maximum), average, and Ward (minimises variance increase). Cutting the dendrogram at a height yields a flat partition.

Ward’s method: At each step, merge clusters that minimise the increase in total within‑cluster variance. This is equivalent to the k‑means objective but hierarchical.

from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.datasets import make_classification

X_hier, _ = make_classification(n_samples=40, n_features=2, n_redundant=0,
                                n_clusters_per_class=1, n_classes=3, random_state=4)

# Ward linkage
Z = linkage(X_hier, method='ward')

# Plot dendrogram
plt.figure(figsize=(12,6))
dendrogram(Z, truncate_mode='level', p=5, leaf_rotation=90, leaf_font_size=10)
plt.title('Hierarchical Clustering Dendrogram (Ward)')
plt.xlabel('Sample index (cluster size)')
plt.ylabel('Ward distance')
plt.tight_layout()
plt.savefig('dendrogram.png')
plt.close()

# Form 3 clusters by cutting at distance 7
labels_hier = fcluster(Z, t=7, criterion='distance')
print(f"Hierarchical cluster labels: {np.unique(labels_hier)}")

2.5 Gaussian Mixture Models (GMM) – Soft Clustering

GMM assumes data is generated from a mixture of $k$ Gaussian distributions with unknown parameters ${π_{j}, μ_{j}, Σ_{j}}$ . The EM algorithm estimates these parameters and yields a probability that each point belongs to each cluster (soft assignment). GMM can capture elliptical clusters of different sizes and orientations.

p (x) = \sum_{j = 1}^{k} π_{j} N (x ∣ μ_{j}, Σ_{j})

The Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC) can be used to select $k$ .

from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
X_iris = iris.data
y_iris = iris.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_iris)

# Fit GMM with 3 components
gmm = GaussianMixture(n_components=3, random_state=42)
gmm_labels = gmm.fit_predict(X_scaled)

# Compare with true labels (for illustration only)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

fig, axes = plt.subplots(1,2,figsize=(12,5))
axes[0].scatter(X_pca[:,0], X_pca[:,1], c=y_iris, cmap='Set1', edgecolor='k')
axes[0].set_title('True Iris species')
axes[1].scatter(X_pca[:,0], X_pca[:,1], c=gmm_labels, cmap='Set1', edgecolor='k')
axes[1].set_title('GMM clustering (unsupervised)')
plt.savefig('gmm_iris.png')
plt.close()

2.6 Spectral Clustering

Spectral clustering uses the eigenvalues (spectrum) of a similarity matrix to reduce dimensionality before clustering in fewer dimensions. It is particularly good for non‑convex clusters. Steps: construct affinity matrix (e.g., k‑nearest neighbour graph), compute Laplacian, extract eigenvectors, and run k‑means on selected eigenvectors.

from sklearn.cluster import SpectralClustering
from sklearn.datasets import make_circles

# Concentric circles
X_circle, _ = make_circles(n_samples=400, factor=0.5, noise=0.05, random_state=42)

sc = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', random_state=42)
labels_spec = sc.fit_predict(X_circle)

plt.scatter(X_circle[:,0], X_circle[:,1], c=labels_spec, cmap='cool', edgecolor='k')
plt.title('Spectral clustering on circles')
plt.savefig('spectral.png')
plt.close()

2.7 Affinity Propagation, BIRCH, Mean Shift

Affinity Propagation does not require specifying the number of clusters; it sends messages between pairs until exemplars emerge. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is designed for large datasets by building a CF‑tree. Mean Shift shifts points towards the mode of density; it finds clusters of any shape without specifying k.

from sklearn.cluster import AffinityPropagation, Birch, MeanShift
# Example on small data
ap = AffinityPropagation(random_state=42).fit(X[:200])
birch = Birch(n_clusters=3).fit(X)
ms = MeanShift().fit(X[:200])

3. Dimensionality Reduction: Seeing the Manifold

High‑dimensional data suffer from the curse of dimensionality. Reducing dimensions to 2 or 3 aids visualisation and often improves generalisation. We cover PCA (linear), Kernel PCA, t‑SNE/UMAP, autoencoders, and modern self‑supervised methods.

3.1 Principal Component Analysis (PCA)

PCA finds orthogonal axes (principal components) that maximise variance. It is based on eigen‑decomposition of the covariance matrix $\frac{1}{n} X^{T} X$ (or SVD of $X$ ). Keeping only the top $d$ components yields the best linear reconstruction in terms of squared error.

Cov (X) v_{i} = λ_{i} v_{i}

The proportion of variance explained by the $i$ -th component is $λ_{i} / \sum λ_{j}$ . This guides the choice of $d$ .

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Standardise (PCA is scale‑sensitive)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_iris)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print("Explained variance ratios:", pca.explained_variance_ratio_)
print("Cumulative:", np.cumsum(pca.explained_variance_ratio_))

# Plot
plt.figure(figsize=(9,7))
scatter = plt.scatter(X_pca[:,0], X_pca[:,1], c=y_iris, cmap='Set1', edgecolor='k', s=60)
plt.xlabel('PC1 ({:.1f}%)'.format(100*pca.explained_variance_ratio_[0]))
plt.ylabel('PC2 ({:.1f}%)'.format(100*pca.explained_variance_ratio_[1]))
plt.title('Iris dataset: PCA projection')
plt.colorbar(scatter)
plt.savefig('pca_iris.png')
plt.close()

3.2 Kernel PCA

Kernel PCA applies the kernel trick to PCA, enabling non‑linear dimensionality reduction. It maps data implicitly to a high‑dimensional feature space and then performs linear PCA in that space.

from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=0.04)
X_kpca = kpca.fit_transform(X_scaled)
plt.scatter(X_kpca[:,0], X_kpca[:,1], c=y_iris, cmap='Set1', edgecolor='k')
plt.title('Kernel PCA (RBF) on Iris')
plt.savefig('kpca_iris.png')
plt.close()

3.3 t‑SNE & UMAP: Nonlinear Neighbourhood Preservation

t‑SNE (t‑Distributed Stochastic Neighbor Embedding) converts pairwise similarities to probabilities and tries to reproduce them in low dimension, using a heavy‑tailed distribution to alleviate crowding. Excellent for visualisation, but stochastic and non‑parametric (no direct map for new points). Perplexity balances local vs. global structure.

UMAP (Uniform Manifold Approximation and Projection) builds on similar principles but is faster and better preserves global structure. It is often preferred for large datasets.

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42, learning_rate='auto', init='pca')
X_tsne = tsne.fit_transform(X_scaled)

plt.figure(figsize=(9,7))
plt.scatter(X_tsne[:,0], X_tsne[:,1], c=y_iris, cmap='Set1', edgecolor='k', s=60)
plt.title('t‑SNE embedding of Iris (perplexity=30)')
plt.xlabel('t‑SNE dim 1')
plt.ylabel('t‑SNE dim 2')
plt.savefig('tsne_iris.png')
plt.close()

# UMAP (if installed)
try:
    import umap
    reducer = umap.UMAP(random_state=42)
    X_umap = reducer.fit_transform(X_scaled)
    plt.figure(figsize=(9,7))
    plt.scatter(X_umap[:,0], X_umap[:,1], c=y_iris, cmap='Set1', edgecolor='k', s=60)
    plt.title('UMAP embedding of Iris')
    plt.savefig('umap_iris.png')
    plt.close()
except ImportError:
    print("UMAP not installed; skipping.")

3.4 Autoencoders: Nonlinear PCA via Neural Nets

An autoencoder is a feedforward network trained to reconstruct its input through a bottleneck (latent space). With linear activations, it learns the PCA subspace; with nonlinearities, it captures complex manifolds. Denoising and variational autoencoders regularise the latent space.

Architecture: encoder $f_{ϕ} : R^{D} \to R^{d}$ , decoder $g_{θ} : R^{d} \to R^{D}$ . Loss: $‖ x - g_{θ} (f_{ϕ} (x)) ‖^{2}$ .

# Simple autoencoder with TensorFlow/Keras (if installed)
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model

input_dim = X_scaled.shape[1]
encoding_dim = 2  # bottleneck size

input_layer = Input(shape=(input_dim,))
encoded = Dense(encoding_dim, activation='linear')(input_layer)
decoded = Dense(input_dim, activation='linear')(encoded)

autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mse')

# Train briefly (real use would need more epochs)
history = autoencoder.fit(X_scaled, X_scaled, epochs=50, batch_size=16,
                          shuffle=True, verbose=0, validation_split=0.2)

# Extract encoder
encoder = Model(input_layer, encoded)
X_latent = encoder.predict(X_scaled)

plt.figure(figsize=(9,7))
plt.scatter(X_latent[:,0], X_latent[:,1], c=y_iris, cmap='Set1', edgecolor='k', s=60)
plt.title('Autoencoder bottleneck (2D) – nonlinear embedding')
plt.xlabel('latent dim 1')
plt.ylabel('latent dim 2')
plt.savefig('autoencoder_iris.png')
plt.close()
print("✅ Autoencoder latent representation extracted.")

3.5 Self‑Supervised / Contrastive Learning

Modern unsupervised representation learning often uses contrastive methods (SimCLR, MoCo, BYOL). They train an encoder by pulling representations of augmented views of the same image together and pushing views of different images apart. These methods learn rich features without any labels and can be used for downstream tasks.

Below is a simplified SimCLR‑style loss (InfoNCE) implemented in TensorFlow for illustration (conceptual).

# Conceptual SimCLR loss (simplified)
import tensorflow as tf

def nt_xent_loss(z_i, z_j, temperature=0.5):
    # Normalise
    z_i = tf.math.l2_normalize(z_i, axis=1)
    z_j = tf.math.l2_normalize(z_j, axis=1)
    representations = tf.concat([z_i, z_j], axis=0)
    similarity_matrix = tf.matmul(representations, representations, transpose_b=True)
    batch_size = tf.shape(z_i)[0]
    labels = tf.concat([tf.range(batch_size), tf.range(batch_size)], axis=0)
    labels = tf.one_hot(labels, 2*batch_size)
    # Mask out self‑similarities
    logits = similarity_matrix / temperature
    loss = tf.nn.softmax_cross_entropy_with_logits(labels, logits)
    return tf.reduce_mean(loss)

4. Anomaly Detection: Finding the Needles

Anomalies (outliers) are samples that differ significantly from the majority. Unsupervised anomaly detection uses density estimation, clustering, or reconstruction error. Common algorithms: Isolation Forest, One‑Class SVM, Local Outlier Factor, Elliptic Envelope, and autoencoder‑based methods.

4.1 Isolation Forest

Isolation Forest isolates anomalies by randomly splitting features. Anomalies are few and different, so they require fewer splits to isolate. Average path length over trees gives an anomaly score. The algorithm is efficient and works well for high‑dimensional data.

from sklearn.ensemble import IsolationForest

# Normal data (two blobs) + outliers
np.random.seed(42)
X_inliers = 0.3 * np.random.randn(200, 2)
X_inliers = np.r_[X_inliers + 2, X_inliers - 2]   # two normal clusters
X_outliers = np.random.uniform(low=-4, high=4, size=(40, 2))
X_total = np.r_[X_inliers, X_outliers]

# Fit Isolation Forest
iforest = IsolationForest(contamination=0.1, random_state=42)
y_pred_if = iforest.fit_predict(X_total)   # -1 = anomaly, 1 = normal

plt.figure(figsize=(9,7))
plt.scatter(X_total[:,0], X_total[:,1], c=y_pred_if, cmap='coolwarm', edgecolor='k', s=60)
plt.title('Isolation Forest: anomalies in red (contamination=0.1)')
plt.savefig('iforest.png')
plt.close()

4.2 One‑Class SVM

One‑Class SVM learns a boundary that encloses most of the data; points outside are anomalies. Works well with non‑linear kernels (RBF). The parameter $ν$ (nu) controls the fraction of outliers.

from sklearn.svm import OneClassSVM

ocsvm = OneClassSVM(kernel='rbf', gamma='scale', nu=0.05)   # nu ~ expected contamination
y_pred_svm = ocsvm.fit_predict(X_total)

plt.figure(figsize=(9,7))
plt.scatter(X_total[:,0], X_total[:,1], c=y_pred_svm, cmap='bwr', edgecolor='k', s=60)
plt.title('One‑Class SVM (RBF) – anomaly detection')
plt.savefig('ocsvm.png')
plt.close()

4.3 Local Outlier Factor (LOF)

LOF measures the local density deviation of a point compared to its neighbours. Points with substantially lower density than neighbours are considered outliers.

from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
y_pred_lof = lof.fit_predict(X_total)

plt.scatter(X_total[:,0], X_total[:,1], c=y_pred_lof, cmap='coolwarm', edgecolor='k')
plt.title('LOF anomaly detection')
plt.savefig('lof.png')
plt.close()

4.4 Elliptic Envelope

Assumes data is Gaussian and fits a robust covariance estimate (Minimum Covariance Determinant). Points with high Mahalanobis distance are flagged.

from sklearn.covariance import EllipticEnvelope
ee = EllipticEnvelope(contamination=0.1, random_state=42)
y_pred_ee = ee.fit_predict(X_total)

4.5 Autoencoder‑based Anomaly Detection

Train an autoencoder on normal data only; anomalies yield high reconstruction error. Use reconstruction error threshold to flag anomalies.

# Assume autoencoder trained on X_inliers
reconstructions = autoencoder.predict(X_total)
mse = np.mean(np.square(X_total - reconstructions), axis=1)
threshold = np.percentile(mse, 95)  # or use contamination
anomaly_pred = (mse > threshold).astype(int)  # 1 = anomaly

5. Evaluating the Unsupervised

Without ground truth, we use internal validation metrics. For clustering: silhouette score, Davies–Bouldin index (lower is better), Calinski–Harabasz index (higher is better). For dimensionality reduction: reconstruction error, trustworthiness, or downstream performance. For anomaly detection: if some labelled anomalies exist, precision/recall can be used; otherwise, we rely on domain inspection.

Silhouette Analysis

from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

ks = range(2,7)
sil_scores = []
for k in ks:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    sil_scores.append(silhouette_score(X_scaled, labels))

plt.plot(ks, sil_scores, 'bo-')
plt.xlabel('k')
plt.ylabel('Silhouette Score')
plt.title('Silhouette analysis for Iris (higher is better)')
plt.savefig('silhouette.png')
plt.close()
print(f"Optimal k by silhouette: {ks[np.argmax(sil_scores)]}")

Davies‑Bouldin Index

Measures average similarity between each cluster and its most similar one. Lower values indicate better separation.

5.1 External validation (if labels exist)

When ground truth is available for evaluation (but not used during training), we can use adjusted Rand index (ARI), normalized mutual information (NMI), homogeneity, completeness, V‑measure, Fowlkes‑Mallows.

from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
ari = adjusted_rand_score(y_true, labels_pred)
nmi = normalized_mutual_info_score(y_true, labels_pred)
print(f'ARI: {ari:.3f}, NMI: {nmi:.3f}')

6. Real‑World Case Studies

6.1 Customer Segmentation (Wholesale Data) – Extended

We apply unsupervised learning to a wholesale customer dataset (UCI “Wholesale customers”). Features: spending on fresh, milk, grocery, frozen, detergents, delicatessen. Goal: segment customers for targeted marketing. We preprocess (log transform to reduce skew), scale, reduce with PCA, cluster with k‑means, and interpret segments.

Steps:

Load and explore data.
Apply log transformation to handle right‑skewed spend data.
Standardise features (zero mean, unit variance).
Use PCA to visualise in 2D.
Determine number of clusters via elbow method.
Run k‑means with chosen k.
Analyse cluster profiles (mean spending per category) to label segments.

# Simulate wholesale dataset (fallback synthetic if not available)
import pandas as pd
try:
    from sklearn.datasets import fetch_openml
    wholesale = fetch_openml(data_id=236, as_frame=True)  # Wholesale customers
    df = wholesale.frame
except:
    # Synthetic replacement (gamma distributed to mimic spend data)
    np.random.seed(42)
    data = np.random.gamma(shape=2, scale=200, size=(500,6))
    df = pd.DataFrame(data, columns=['Fresh','Milk','Grocery','Frozen','Detergents_Paper','Delicassen'])

# Log transform (to handle skewness)
df_log = np.log1p(df)

# Standardise
scaler = StandardScaler()
X_wholesale = scaler.fit_transform(df_log)

# PCA for 2D visualisation
pca_wh = PCA(n_components=2)
X_pca_wh = pca_wh.fit_transform(X_wholesale)

# Elbow method to guess k
inertias = []
for k in range(2,9):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_wholesale)
    inertias.append(km.inertia_)

plt.plot(range(2,9), inertias, 'o-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('Elbow for wholesale data')
plt.savefig('wholesale_elbow.png')
plt.close()

# Choose k=4 (by elbow) and segment
kmeans_final = KMeans(n_clusters=4, random_state=42, n_init=10)
segments = kmeans_final.fit_predict(X_wholesale)

# Analyse segment means in original scale
df['Segment'] = segments
print(df.groupby('Segment').mean())

# Visualise segments in PCA space
plt.figure(figsize=(10,8))
plt.scatter(X_pca_wh[:,0], X_pca_wh[:,1], c=segments, cmap='tab10', alpha=0.7, s=60)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Customer segments in PCA space')
plt.colorbar()
plt.savefig('wholesale_segments.png')
plt.close()

Interpretation: Segment 0 might be "restaurants" (high fresh & frozen), Segment 1 "retail" (high grocery & detergents), etc. Such profiles inform marketing strategies.

6.2 MNIST Image Clustering

Clustering handwritten digits without labels – we use dimensionality reduction (UMAP/t‑SNE) followed by k‑means, then evaluate against true labels (for demonstration).

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False, parser='auto')
X_mnist = mnist.data[:5000]  # subset for speed
y_mnist = mnist.target[:5000].astype(int)

# Preprocessing: scale to [0,1]
X_mnist = X_mnist / 255.0

# UMAP reduction
reducer = umap.UMAP(n_components=2, random_state=42)
X_umap_mnist = reducer.fit_transform(X_mnist)

# Cluster with k-means (k=10)
km_mnist = KMeans(n_clusters=10, random_state=42)
clusters_mnist = km_mnist.fit_predict(X_umap_mnist)

# Visualise
plt.scatter(X_umap_mnist[:,0], X_umap_mnist[:,1], c=clusters_mnist, cmap='tab10', s=1, alpha=0.7)
plt.title('MNIST clusters after UMAP')
plt.savefig('mnist_clusters.png')
plt.close()

# Evaluate ARI
print('ARI vs true labels:', adjusted_rand_score(y_mnist, clusters_mnist))

6.3 Credit Card Fraud Detection (Anomaly)

Use Isolation Forest on a subset of credit card data (fraud is rare).

# Dummy example; real dataset would be from Kaggle
# X_credit = ... (features), y_credit = labels (0=normal,1=fraud)
# Train Isolation Forest (contamination=0.01)
# Evaluate precision/recall on test set

7. Challenges and Advanced Directions

Unsupervised learning is subtle: feature scaling, distance choice, hyperparameters, and initialisation all affect results. Validation must be grounded in domain knowledge. Modern directions include:

Self‑supervised learning (contrastive methods like SimCLR, BYOL) that learn representations from unlabelled data by solving pretext tasks (e.g., predicting rotation, contrasting augmented views).
Deep clustering (e.g., DeepCluster, SwAV) that jointly learns features and cluster assignments.
Graph‑based clustering (spectral clustering, community detection) for network data.
Variational autoencoders (VAEs) and generative adversarial networks (GANs) for learning latent distributions and generating new samples.
Dimensionality reduction for streaming data (incremental PCA, t‑SNE with landmarks).

7.1 High‑Dimensional Data

In high dimensions, distances become less meaningful (curse of dimensionality). Use dimensionality reduction first, or employ cosine similarity, or subspace clustering methods (e.g., CLIQUE, SUBCLU).

7.2 Non‑Vectorial Data (Graphs, Sequences)

For graph data, community detection (Louvain, Leiden) and node embeddings (node2vec, GraphSAGE) are common. For time series, clustering using dynamic time warping (DTW) or features.

7.3 Deep Clustering

Deep embedded clustering (DEC) learns feature representations and cluster assignments simultaneously. The loss function includes a clustering loss (KL divergence) plus reconstruction.

# Pseudo‑code for DEC: 
# 1. Pre‑train autoencoder.
# 2. Initialise cluster centres with k‑means on latent space.
# 3. Fine‑tune with KL divergence between soft assignments and auxiliary target distribution.

8. Production & Scalability Considerations

Deploying unsupervised models requires attention to:

Scalability: K‑means and PCA scale linearly (via mini‑batch K‑means, incremental PCA). DBSCAN is $O (n^{2})$ but can be accelerated with spatial indexes. t‑SNE is slow for >100k points; UMAP or parametric t‑SNE are alternatives.
Model persistence: Save trained models (centroids, PCA components, autoencoder weights) for inference.
Online learning: For streaming data, use mini‑batch K‑means, online GMM, or autoencoders with incremental training.
Interpretability: SHAP or feature contributions can explain cluster assignments.
Monitoring: Track cluster sizes, reconstruction error, or anomaly scores over time to detect drift.

# Example: mini‑batch K‑means for large datasets
from sklearn.cluster import MiniBatchKMeans

mbk = MiniBatchKMeans(n_clusters=5, batch_size=100, random_state=42)
mbk_labels = mbk.fit_predict(X)  # X could be millions of rows

9. Fairness, Ethics & Interpretability

Unsupervised models can inadvertently encode biases present in data. For example, customer segmentation may lead to discriminatory pricing if clusters correlate with protected attributes. It's crucial to audit clusters for fairness and ensure interpretability. Tools like SHAP can explain why a point is assigned to a particular cluster.

10. Conclusion & Further Reading

We have explored the core pillars of unsupervised learning: clustering (k‑means, DBSCAN, hierarchical, GMM, spectral, etc.), dimensionality reduction (PCA, t‑SNE, UMAP, autoencoders), and anomaly detection (Isolation Forest, One‑Class SVM, LOF). These tools let us extract value from unlabelled data, detect fraud, segment customers, compress features, and drive scientific discovery. In production, pipelines must be carefully validated and monitored — but unsupervised learning remains an indispensable part of the machine learning practitioner’s arsenal.

Recommended resources:

"Pattern Recognition and Machine Learning" by Christopher Bishop (chapters on mixture models, EM).
"The Elements of Statistical Learning" by Hastie et al. (PCA, clustering).
scikit‑learn documentation: Unsupervised learning.
UMAP documentation: https://umap-learn.readthedocs.io/.
Deep learning book by Goodfellow et al. (autoencoders, generative models).

📌 Complete runnable code

All code blocks are valid Python and can be executed sequentially in a Jupyter notebook or script. Required libraries: numpy, matplotlib, scikit-learn, scipy, tensorflow, pandas, umap-learn (optional). Install via pip install numpy matplotlib scikit-learn scipy tensorflow pandas umap-learn. Each segment produces diagnostic plots and prints; random seeds are fixed for reproducibility.

Unsupervised Learning:Finding Hidden Patterns in Chaos