Bivariate Analysis (Categorical vs. Numeric)
Compare numerical values across different categories.
π Introduction: Comparative Shopping
Categorical vs. Numerical Analysis (Cat-Num EDA) is like comparing the average price of an item across different stores. You are comparing a **Numerical** value (Price) across different **Categories** (Store A, Store B).
The key goal is to find out if there's a significant difference in the numerical property of objects belonging to different groups. For example: Does **Gender** influence the **Fare** paid?
[Image of a boxplot chart showing quartiles and outlier points]π Topic 1: The Average: sns.barplot()
A **Barplot** is the simplest way to see the difference in the average (mean) of the numerical column across categories. It's the "fast check" for comparison.
β Bar Plot Features:
- **Height:** Represents the mean value of the numerical column (Y-axis).
- **X-Axis:** The categorical groups.
- **Error Bars:** The small black lines show the standard error, indicating the variability around the mean.
π» Example: Average Fare by Passenger Class
sns.barplot(x='Pclass', y='Fare', data=df)
plt.title("Average Fare by Passenger Class")
plt.show()
π¦ Topic 2: The Full Picture: sns.boxplot()
A **Grouped Boxplot** is far superior to a barplot because it shows the *entire distribution* for every category. This is crucial for comparing the spread and finding group-specific outliers.
π‘ Boxplot Revelation:
It helps answer: "Does **Pclass 1** have the same variation in **Age** as **Pclass 3**?" It compares medians, not means, making it robust against outliers.
π» Example: Age Distribution by Class
sns.boxplot(x='Pclass', y='Age', data=df)
plt.title("Age Distribution by Passenger Class")
π» Topic 3: Advanced View: sns.violinplot()
A **Violin Plot** combines a boxplot with a **Kernel Density Estimate (KDE)** plot. This shows the actual *shape* of the data distribution (density) for each category, which a boxplot cannot do.
If you have bimodal data (two peaks), a boxplot won't show it, but a violin plot clearly visualizes the two density peaks.
π» Example: Fare Density by Embarkation Port
sns.violinplot(x='Embarked', y='Fare', data=df)
plt.title("Fare Distribution Density")
π¨ Topic 4: Adding Hue (Multivariate Analysis)
You can upgrade any Bivariate plot (like Barplot or Boxplot) to a Multivariate plot by adding a **third categorical variable** using the **hue** parameter.
**Question:** *Does Pclass affect Fare differently for Men and Women?* The hue parameter answers this by creating separate bars/boxes for each group.
π» Example: Fare by Pclass and Sex
sns.barplot(x='Pclass', y='Fare', hue='Sex', data=df)
plt.title("Average Fare by Pclass & Sex")
π’ Topic 5: Exact Numbers: .groupby()
Plots are great, but the **exact figures** come from **Pandas**. You use .groupby() to break down the calculation by category.
π» Example: Full Statistical Summary
age_summary_by_class = df.groupby('Pclass')['Age'].describe()
print(age_summary_by_class)
π Module Summary
- Barplot: Compares the **Mean** (Average) across categories.
- Boxplot: Compares the **Distribution** (Median, Spread) across categories. Best for seeing outliers.
- Violin Plot: Shows the **Density** (Shape of distribution).
- Hue: Adds a third category (Multivariate analysis).
- GroupBy: The Pandas method to get exact numerical summaries.
π€ Interview Q&A
Tap on the questions below to reveal the answers.
A Barplot is misleading when the data has Outliers. Barplots only show the mean, which is easily skewed by outliers. A Boxplot shows the median and the entire spread, giving a truer picture of the distribution.
By default, sns.barplot() plots the **Mean (Average)** of the numerical variable (Y-axis) for each category (X-axis). You can change this using the estimator parameter.
You add a third categorical variable using the **hue** parameter (e.g., sns.boxplot(x='City', y='Sales', hue='Quarter')). This creates side-by-side plots for every combination.
.groupby().describe() provides the full statistical snapshot (count, mean, std, min, max, and all quartiles) for the grouped data. .mean() only provides one number (the average), which is insufficient for checking data spread and quality.