Unsupervised Learning: Clustering
Learn how to group unlabeled data using K-Means, DBSCAN, and Hierarchical methods.
💡 Topic 1: Clustering (Unsupervised Learning)
Clustering is a core technique in Unsupervised Learning. This means the algorithm analyzes data without any predefined labels or correct answers (unlike regression or classification).
The algorithm's job is to automatically find hidden structures or groups in the data. It clusters data points that are "similar" to each other based on their features (e.g., proximity in a multi-dimensional space).
Real-World Analogy: Music Recommendations
The app groups users by similar listening habits (e.g., users who listen to fast, guitar-heavy songs vs. slow, orchestral songs). These groups are the clusters used for targeted recommendations.
Key Use Cases:
- Customer Segmentation: Grouping customers by purchasing habits to target marketing campaigns.
- Anomaly Detection: Identifying data points that don't fit into any group (outliers) as suspicious activity.
- Data Reduction: Using cluster centroids to summarize large datasets.
🎯 Topic 2: K-Means Clustering (Centroid-Based)
K-Means is the most popular clustering algorithm. It aims to partition data into a predefined number, K, of clusters.
The algorithm works by iteratively finding the best position for cluster centers (centroids) through two steps: Assign (each point goes to the nearest center) and Update (the center moves to the average location of its assigned points). The objective is to minimize the total Inertia (Within-Cluster Sum of Squares).
⚠️ Critical Requirement: Data Scaling
K-Means measures Euclidean distance. If features have vastly different scales (e.g., 'Age' vs 'Salary'), the larger feature dominates the distance calculation. You must scale your data (e.g., with StandardScaler) before using K-Means.
💻 Example: K-Means Implementation
from sklearn.cluster import KMeansfrom sklearn.preprocessing import StandardScaler
scaler = StandardScaler()X_scaled = scaler.fit_transform(X)
model = KMeans(n_clusters=3, random_state=42, n_init='auto')model.fit(X_scaled)labels = model.labels_
🦴 Topic 3: The Elbow Method (Choosing K)
The major weakness of K-Means is that you have to choose K beforehand. The Elbow Method is a heuristic technique to visually determine the optimal number of clusters.
We measure Inertia (or WCSS - Within-Cluster Sum of Squares), which is the sum of squared distances from every point to its closest centroid. As K increases, Inertia always decreases.
We plot K vs Inertia. The ideal K value is the point where the rate of decrease in inertia slows down significantly, creating a bend or "elbow" on the graph. This point represents the diminishing returns. [Image of Elbow Method graph]
🌳 Topic 4: Hierarchical Clustering (Dendrograms)
This method does not require a predefined K. Instead, it builds a tree structure of clusters called a Dendrogram, showing the merging or splitting process of every data point.
Agglomerative (Bottom-Up):
The most common type starts with every data point as its own cluster (N clusters) and then progressively merges the two closest clusters (based on a linkage criterion) until only one giant cluster remains. The final cluster number (K) is chosen by cutting the dendrogram at a specific height.
💻 Example: Implementing Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering
agg_model = AgglomerativeClustering(n_clusters=5, linkage='ward')labels = agg_model.fit_predict(X_scaled)
🌌 Topic 5: DBSCAN (Density-Based)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is powerful because it finds clusters of arbitrary shapes (moons, spirals), unlike K-Means which is limited to spherical clusters. It groups points based on density.
Key Parameters:
eps: The radius of the neighborhood to check for density around a point.min_samples: The minimum number of points required within theepsradius to form a dense region (a cluster core).
A major advantage is that DBSCAN automatically flags points that don't belong to any dense region as Noise (outliers), labeled as -1.
📚 Module Summary
- K-Means: Fast, scalable. Requires predefined K. Assumes spherical clusters. Requires scaling.
- Hierarchical: Creates a Dendrogram. No predefined K, but final K is chosen by cutting the tree. Slow for large data.
- DBSCAN: Finds arbitrary shapes. Automatically flags outliers (-1). Sensitive to density variations.
🤔 Interview Q&A
Tap on the questions below to reveal the answers.