DSPython - Bivariate (Categorical) Analysis

Bivariate Analysis (Categorical vs. Numeric)

Compare numerical values across different categories.

Data Visualization Beginner 45 min

🛒 Introduction: Comparative Shopping

Categorical vs. Numerical Analysis (Cat-Num EDA) is like comparing the average price of an item across different stores. You are comparing a **Numerical** value (Price) across different **Categories** (Store A, Store B).

The key goal is to find out if there's a significant difference in the numerical property of objects belonging to different groups. For example: Does **Gender** influence the **Fare** paid?

[Image of a boxplot chart showing quartiles and outlier points]

📊 Topic 1: The Average: `sns.barplot()`

A **Barplot** is the simplest way to see the difference in the average (mean) of the numerical column across categories. It's the "fast check" for comparison.

✅ Bar Plot Features:

**Height:** Represents the mean value of the numerical column (Y-axis).
**X-Axis:** The categorical groups.
**Error Bars:** The small black lines show the standard error, indicating the variability around the mean.

💻 Example: Average Fare by Passenger Class

sns.boxplot(x='Pclass', y='Age', data=df)
plt.title("Age Distribution by Passenger Class")
plt.show()

📦 Topic 2: The Full Picture: `sns.boxplot()`

A **Grouped Boxplot** is far superior to a barplot because it shows the *entire distribution* for every category. This is crucial for comparing the spread and finding group-specific outliers.

💡 Boxplot Revelation:

It helps answer: "Does **Pclass 1** have the same variation in **Age** as **Pclass 3**?" It compares medians, not means, making it robust against outliers.

💻 Example: Age Distribution by Class

sns.boxplot(x='Pclass', y='Age', data=df)
plt.title("Age Distribution by Passenger Class")

🎻 Topic 3: Advanced View: `sns.violinplot()`

A **Violin Plot** combines a boxplot with a **Kernel Density Estimate (KDE)** plot. This shows the actual *shape* of the data distribution (density) for each category, which a boxplot cannot do.

If you have bimodal data (two peaks), a boxplot won't show it, but a violin plot clearly visualizes the two density peaks.

💻 Example: Fare Density by Embarkation Port

sns.violinplot(x='Embarked', y='Fare', data=df)
plt.title("Fare Distribution Density")

🎨 Topic 4: Adding Hue (Multivariate Analysis)

You can upgrade any Bivariate plot (like Barplot or Boxplot) to a Multivariate plot by adding a **third categorical variable** using the **hue** parameter.

**Question:** *Does Pclass affect Fare differently for Men and Women?* The hue parameter answers this by creating separate bars/boxes for each group.

💻 Example: Fare by Pclass and Sex

sns.barplot(x='Pclass', y='Fare', hue='Sex', data=df)
plt.title("Average Fare by Pclass & Sex")

# Now you have bars grouped by Pclass, and colored by Sex.

🔢 Topic 5: Exact Numbers: `.groupby()`

Plots are great, but the **exact figures** come from **Pandas**. You use .groupby() to break down the calculation by category.

💻 Example: Full Statistical Summary

age_summary_by_class = df.groupby('Pclass')['Age'].describe()
print(age_summary_by_class)

# Output includes count, mean, min, max, and all quartiles for Age, grouped by Pclass.

📚 Module Summary

Barplot: Compares the **Mean** (Average) across categories.
Boxplot: Compares the **Distribution** (Median, Spread) across categories. Best for seeing outliers.
Violin Plot: Shows the **Density** (Shape of distribution).
Hue: Adds a third category (Multivariate analysis).
GroupBy: The Pandas method to get exact numerical summaries.

🤔 Interview Q&A

Tap on the questions below to reveal the answers.

A Barplot is misleading when the data has Outliers. Barplots only show the mean, which is easily skewed by outliers. A Boxplot shows the median and the entire spread, giving a truer picture of the distribution.

By default, sns.barplot() plots the **Mean (Average)** of the numerical variable (Y-axis) for each category (X-axis). You can change this using the estimator parameter.

You add a third categorical variable using the **hue** parameter (e.g., sns.boxplot(x='City', y='Sales', hue='Quarter')). This creates side-by-side plots for every combination.

.groupby().describe() provides the full statistical snapshot (count, mean, std, min, max, and all quartiles) for the grouped data. .mean() only provides one number (the average), which is insufficient for checking data spread and quality.

Bivariate Analysis (Categorical vs. Numeric)

🛒 Introduction: Comparative Shopping

📊 Topic 1: The Average: `sns.barplot()`

✅ Bar Plot Features:

💻 Example: Average Fare by Passenger Class

📦 Topic 2: The Full Picture: `sns.boxplot()`

💡 Boxplot Revelation:

💻 Example: Age Distribution by Class

🎻 Topic 3: Advanced View: `sns.violinplot()`

💻 Example: Fare Density by Embarkation Port

🎨 Topic 4: Adding Hue (Multivariate Analysis)

💻 Example: Fare by Pclass and Sex

🔢 Topic 5: Exact Numbers: `.groupby()`

💻 Example: Full Statistical Summary

📚 Module Summary

🤔 Interview Q&A

Practice Question

Loading Question...

Output Console

Bivariate Analysis (Categorical vs. Numeric)

🛒 Introduction: Comparative Shopping

📊 Topic 1: The Average: sns.barplot()

✅ Bar Plot Features:

💻 Example: Average Fare by Passenger Class

📦 Topic 2: The Full Picture: sns.boxplot()

💡 Boxplot Revelation:

💻 Example: Age Distribution by Class

🎻 Topic 3: Advanced View: sns.violinplot()

💻 Example: Fare Density by Embarkation Port

🎨 Topic 4: Adding Hue (Multivariate Analysis)

💻 Example: Fare by Pclass and Sex

🔢 Topic 5: Exact Numbers: .groupby()

💻 Example: Full Statistical Summary

📚 Module Summary

🤔 Interview Q&A

Practice Question

Loading Question...

Output Console

📊 Topic 1: The Average: `sns.barplot()`

📦 Topic 2: The Full Picture: `sns.boxplot()`

🎻 Topic 3: Advanced View: `sns.violinplot()`

🔢 Topic 5: Exact Numbers: `.groupby()`