Decision Trees in Machine Learning
Learn to build intuitive classification models, visualize them, and prevent overfitting.
Topic 1: What is a Decision Tree? 🌳
A Decision Tree is one of the most intuitive and powerful models in machine learning. It's a supervised learning algorithm used for both classification (e.g., Is this a Cat, Dog, or Bird?) and regression (e.g., What will the price be?).
It works by asking a series of simple "yes/no" questions to split the data, just like a flowchart or a game of "20 Questions."
Real-World Analogy: Classifying a Fruit
Imagine building a model to identify a fruit. The tree would learn a set of rules structured by nodes:
- Root Node: "Is the color Red?"
- Yes: "Is the diameter < 2 inches?"
- Yes: It's a Cherry (Leaf Node)
- No: It's an Apple (Leaf Node)
- No: "Is the color Yellow?"
- Yes: It's a Banana (Leaf Node)
- No: It's a Grape (Leaf Node)
The Decision Tree is a "white box" model, meaning we can see and understand every single rule it learned. This interpretability is its biggest advantage.
Key Parts of a Tree
- Root Node: The very first question that splits all the data and provides the maximum information gain (e.g., "Is the color Red?").
- Decision Node: Any node that asks a question and splits the data further into sub-nodes (branches).
- Leaf Node (Terminal Node): The final answer. A node that makes a final prediction and has no further branches.
Topic 2: How Does a Tree "Learn"? (Splitting Criterion)
How does the tree know which question to ask first? It doesn't guess; it uses mathematics to find the question that creates the purest possible split.
The goal is to find a split that results in new child groups that are as "un-mixed" as possible (i.e., contain mostly one class).
Analogy: Sorting Marbles
Imagine a starting node (jar) with 50 red and 50 blue marbles (maximum "impurity"). A perfect split would create two new jars: one with 50 red and zero blue, and the other with zero red and 50 blue. The new jars are completely "pure."
We measure this "purity" using two main mathematical criteria, maximizing the Information Gain after the split:
1. Gini Impurity (Default for Classification)
Gini Impurity is the probability of incorrectly labeling a randomly chosen data point if you labeled it according to the class distribution in the node.
- A Pure Node (100% Class A) has a Gini Impurity of 0.0.
- An Impure Node (50% Class A, 50% Class B) has the maximum Gini Impurity of 0.5.
The tree chooses the split point (a feature and a threshold) that yields the biggest decrease in impurity across the resulting child nodes. This decrease is the Information Gain.
2. Entropy (Measure of Disorder)
Entropy is a concept from information theory used to measure the "disorder" or "surprise" within a node.
- A Pure Node (100% Setosa) has 0.0 Entropy (no disorder, no surprise).
- An Impure Node (50% Setosa, 50% Versicolor) has 1.0 Entropy (maximum disorder).
The calculation is different, but the goal is the same: select the split that maximizes the drop in Entropy (i.e., maximizes Information Gain). By default, sklearn uses criterion='gini', but you can set criterion='entropy'.
Topic 3: The Overfitting Problem (And Pruning) 🛑
The single biggest weakness of Decision Trees is their tendency to overfit. If left unchecked, the tree will grow infinitely deep.
Overfitting means the model "memorizes" the training data, including its noise and outliers, instead of "learning" the general, underlying patterns. This results in 100% accuracy on the training set but miserable performance on new, unseen data.
The Solution: Pruning (Stopping the Growth)
We "prune" the tree by setting limits on its growth before we train it. These hyperparameters stop the tree from asking trivial or overly specific questions:
Key Pruning Parameters (Hyperparameters)
max_depth(e.g.,max_depth=3)
This is the most common method. It sets the maximum number of decision levels. This forces the model to capture only the most important, general rules.min_samples_leaf(e.g.,min_samples_leaf=5)
This tells the tree: "Don't make a split if any of the resulting leaf nodes would have fewer than 5 samples." This prevents the tree from creating tiny, specific nodes for individual data points.min_samples_split(e.g.,min_samples_split=20)
This tells the tree: "Don't even try to split a node unless it has at least 20 samples in it." This prevents splitting nodes that are already quite small.
Topic 4: Feature Importance and Interpretability
One of the best features of a Decision Tree is that it's a "white-box" model. We can easily ask it how it made its decisions, a process known as interpretability.
The tree automatically calculates which features were the most "useful" for reducing impurity. A feature that is used high up in the tree (at the root or first few decision nodes) provides the most Information Gain and is therefore considered the most important.
We can access this importance score easily after training in sklearn:
# 1. Train the model
model.fit(X_train, y_train)
# 2. Get the importance scores (sum to 1.0)
importances = model.feature_importances_
# e.g., [0.01, 0.0, 0.95, 0.04] -> Feature 2 is 95% important
# 3. Visualize the results
# You can then plot this to see which features mattered most
# sns.barplot(x=importances, y=feature_names)
Advantages and Disadvantages Summary
| Advantage (+) | Disadvantage (-) |
|---|---|
| High Interpretability: "White Box" model (Easy to explain). | Overfitting: Prone to memorizing data if not pruned. |
| No Scaling Required: Works well with numerical and categorical data without normalization. | Instability: Small changes in the training data can result in a completely different tree structure. |
| Feature Importance: Automatically ranks features by usefulness. | Bias towards Dominant Features: Tends to prefer features with more levels or distinct values. |
Topic 5: Frequently Asked Questions (FAQ)
Both measure impurity, but Entropy is based on logarithms (information theory) while Gini is simpler (based on squared probabilities). Gini is slightly faster to compute, which is why it's often the default. They rarely lead to different best split points.
Unlike linear models, Decision Trees are generally robust to outliers in the feature values because the splits are based on relative thresholds, not absolute distance or mean. However, an outlier with a very unique label can cause the tree to overfit by creating a highly specific leaf node just for that point.
Information Gain (IG) is the reduction in impurity achieved by splitting a node. It equals (Impurity of Parent Node) - (Weighted Average Impurity of Child Nodes). The algorithm always chooses the split that results in the maximum IG.