DSPython Logo DSPython

Bivariate Analysis (Numeric vs. Numeric)

Learn to find relationships between two numerical variables.

Data Visualization Beginner 45 min

πŸ’‘ Introduction: Data Synchronization

Bivariate Analysis is the study of how two variables interact. The key question is: "When one value changes, does the other reliably follow?"

We are looking for **Correlation**β€”a statistical measure of the degree to which two variables are related in a linear fashion. This is the first step in determining if one variable can be used to predict another in modeling.

Tools used here are Scatterplots (visualizing individual points) and the **Correlation Matrix** (calculating the relationship strength).

πŸ—ΊοΈ Topic 1: Visualizing Relationships: sns.scatterplot()

The **scatterplot** is the foundation of Bivariate analysis. It plots each row of data as a single point, showing the direct relationship between the X-axis variable and the Y-axis variable.

By looking at the cloud of dots, you can instantly see if the variables are **Positive** (dots going up and right), **Negative** (dots going down and right), or **None** (dots scattered randomly).

πŸ’» Example: Age vs. Fare

sns.scatterplot(x='Age', y='Fare', data=df)
plt.title("Age vs. Fare Paid")
plt.show()

πŸ’― Topic 2: Measuring Strength: .corr()

Visuals are great, but for a machine learning model, we need a number. **Correlation** is that number, ranging from **-1.0 to +1.0**.

πŸ“ˆ Interpreting Correlation:

  • +1.0 (Strong Positive): As X increases, Y increases proportionally (e.g., Study Hours vs. Score).
  • -1.0 (Strong Negative): As X increases, Y decreases proportionally (e.g., Vehicle Age vs. Price).
  • 0.0 (No Relation): No predictable linear pattern (e.g., Shoe Size vs. Salary).

πŸ’» Example: Calculating the Matrix

# Select ONLY numerical columns for the calculation
numerical_df = df[['Age', 'Fare', 'Survived']]
corr_matrix = numerical_df.corr()
print(corr_matrix)

πŸ”₯ Topic 3: Visualizing the Matrix: sns.heatmap()

When you have 10-20 numerical columns, reading the correlation matrix numbers is difficult. A **Heatmap** uses color intensity to instantly show the strength of every possible relationship.

πŸ’‘ Heatmap Interpretation:

  • Bright Color (e.g., Red/Yellow): Strong positive correlation.
  • Dark Color (e.g., Blue): Strong negative correlation.
  • Neutral Color (e.g., White/Gray): Weak or no correlation.

πŸ’» Example: Generating the Heatmap

import matplotlib.pyplot as plt
corr_matrix = df[['Age', 'Fare', 'Survived']].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.show()
# annot=True shows the correlation number on each square.

🎨 Topic 4: Adding Hue (Bivariate to Multivariate)

We can make a Bivariate plot Multivariate by adding a **third categorical variable** using the **hue** parameter (color).

This lets us ask: *Is the relationship between Age and Fare different for men and women?*

πŸ’» Example: Scatterplot with Hue

sns.scatterplot(x='Age', y='Fare', hue='Sex', data=df)
plt.title("Age vs. Fare, Colored by Sex")
plt.show()

πŸ“š Module Summary

  • Scatter Plot: Visualizes the relationship between two numerical variables (X and Y).
  • Correlation: A single number (-1 to +1) measuring the strength and direction of the linear relationship.
  • Heatmap: Visualizes the entire correlation matrix using color intensity.
  • Hue: Parameter used to add a third categorical variable to a plot.

πŸ€” Interview Q&A

Tap on the questions below to reveal the answers.

A **Scatter Plot** shows individual data points and is used to find correlation between non-sequential variables. A **Line Plot** connects points in order and is typically used to show trends over time (Time Series Data).

A correlation of 0.0 means there is **no linear relationship**. If a feature (column) has a correlation of 0.0 with the target variable, it adds no predictive value and can usually be removed from the model to simplify it.

A Heatmap provides **visual immediacy**. You can instantly spot the strongest and weakest correlations (bright colors vs dark colors) across dozens of variables without having to carefully read and compare every single number in the matrix.

Absolutely not. Correlation only shows that two variables change together. Causation means one variable *causes* the other. Example: Ice cream sales and drowning incidents might be highly correlated, but the cause is actually the heat (Summer).

πŸ€–
DSPython AI Assistant βœ–
πŸ‘‹ Hi! I’m your AI assistant. Paste your code here, I will find bugs for you.