Unsupervised Learning: Clustering

Learn how to group unlabeled data using K-Means, DBSCAN, and Hierarchical methods.

Python Basics Beginner 30 min

💡 Topic 1: Clustering (Unsupervised Learning)

Clustering is a core technique in Unsupervised Learning. This means the algorithm analyzes data without any predefined labels or correct answers (unlike regression or classification).

The algorithm's job is to automatically find hidden structures or groups in the data. It clusters data points that are "similar" to each other based on their features (e.g., proximity in a multi-dimensional space).

Real-World Analogy: Music Recommendations

The app groups users by similar listening habits (e.g., users who listen to fast, guitar-heavy songs vs. slow, orchestral songs). These groups are the clusters used for targeted recommendations.

Key Use Cases:

Customer Segmentation: Grouping customers by purchasing habits to target marketing campaigns.
Anomaly Detection: Identifying data points that don't fit into any group (outliers) as suspicious activity.
Data Reduction: Using cluster centroids to summarize large datasets.

🎯 Topic 2: K-Means Clustering (Centroid-Based)

K-Means is the most popular clustering algorithm. It aims to partition data into a predefined number, K, of clusters.

The algorithm works by iteratively finding the best position for cluster centers (centroids) through two steps: Assign (each point goes to the nearest center) and Update (the center moves to the average location of its assigned points). The objective is to minimize the total Inertia (Within-Cluster Sum of Squares).

⚠️ Critical Requirement: Data Scaling

K-Means measures Euclidean distance. If features have vastly different scales (e.g., 'Age' vs 'Salary'), the larger feature dominates the distance calculation. You must scale your data (e.g., with StandardScaler) before using K-Means.

💻 Example: K-Means Implementation

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = KMeans(n_clusters=3, random_state=42, n_init='auto')
model.fit(X_scaled)
labels = model.labels_

🦴 Topic 3: The Elbow Method (Choosing K)

The major weakness of K-Means is that you have to choose K beforehand. The Elbow Method is a heuristic technique to visually determine the optimal number of clusters.

We measure Inertia (or WCSS - Within-Cluster Sum of Squares), which is the sum of squared distances from every point to its closest centroid. As K increases, Inertia always decreases.

We plot K vs Inertia. The ideal K value is the point where the rate of decrease in inertia slows down significantly, creating a bend or "elbow" on the graph. This point represents the diminishing returns. [Image of Elbow Method graph]

🌳 Topic 4: Hierarchical Clustering (Dendrograms)

This method does not require a predefined K. Instead, it builds a tree structure of clusters called a Dendrogram, showing the merging or splitting process of every data point.

Agglomerative (Bottom-Up):

The most common type starts with every data point as its own cluster (N clusters) and then progressively merges the two closest clusters (based on a linkage criterion) until only one giant cluster remains. The final cluster number (K) is chosen by cutting the dendrogram at a specific height.

💻 Example: Implementing Agglomerative Clustering

from sklearn.cluster import AgglomerativeClustering

agg_model = AgglomerativeClustering(n_clusters=5, linkage='ward')
labels = agg_model.fit_predict(X_scaled)

🌌 Topic 5: DBSCAN (Density-Based)

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is powerful because it finds clusters of arbitrary shapes (moons, spirals), unlike K-Means which is limited to spherical clusters. It groups points based on density.

Key Parameters:

eps: The radius of the neighborhood to check for density around a point.
min_samples: The minimum number of points required within the eps radius to form a dense region (a cluster core).

A major advantage is that DBSCAN automatically flags points that don't belong to any dense region as Noise (outliers), labeled as -1.

📚 Module Summary

K-Means: Fast, scalable. Requires predefined K. Assumes spherical clusters. Requires scaling.
Hierarchical: Creates a Dendrogram. No predefined K, but final K is chosen by cutting the tree. Slow for large data.
DBSCAN: Finds arbitrary shapes. Automatically flags outliers (-1). Sensitive to density variations.

🤔 Interview Q&A

Tap on the questions below to reveal the answers.

The major weakness is that it requires the number of clusters (K) to be specified beforehand, and it performs poorly on clusters that are non-spherical or have varying densities.

K-Means measures distance (Euclidean distance). If features are on different scales (e.g., Age 20-60 vs Income 50k-1M), the larger feature will disproportionately influence the distance calculation, leading to inaccurate clusters.

DBSCAN automatically handles outliers by labeling them with -1. These points are considered "noise" because they do not belong to any dense region, making it suitable for outlier detection.

You plot K vs. Inertia (WCSS). The optimal K is the point on the graph where the line suddenly stops bending sharply and starts to flatten out. This represents the point of diminishing returns.