DSPython - Univariate Analysis

Univariate Analysis (One Variable)

Master the art of analyzing a single variable: Distributions, Counts, and Outliers.

Data Visualization Beginner 30 min

🔍 Introduction: "Uni" means One

Univariate Analysis is the first step in EDA. It involves analyzing data **one variable at a time**. It doesn't look at causes or relationships (like "does X affect Y?"), it just describes the data we have.

We divide variables into two types: 1. **Categorical:** (Species, City, Yes/No) -> We look at "Counts". 2. **Numerical:** (Age, Price, Height) -> We look at "Distribution" & "Spread".

📊 Topic 1: Categorical Data

For text or groups, we simply want to know: **How many items are in each group?**

✅ Key Tools:

**value_counts():** The Pandas text summary.
**sns.countplot():** The visual bar chart.

💻 Example: Counting Species

print(df['species'].value_counts())

sns.countplot(x='species', data=df)
plt.title("Number of Penguins per Species")
plt.show()

📈 Topic 2: Numerical Distribution (`histplot`)

For numbers, we want to see the **Shape**. Is the data centered? Is it skewed (tilted) to one side?

💡 The Histogram & KDE:

**Histogram:** Groups numbers into "bins" and counts them.
**KDE (Kernel Density Estimate):** Draws a smooth curve over the histogram to show the "flow" of data.

[Image of a normal distribution curve]

💻 Example: Body Mass Distribution

sns.histplot(x='body_mass_g', data=df, kde=True)

# kde=True adds the smooth curve line.

📦 Topic 3: Outliers & Spread (`boxplot`)

Histograms show shape, but **Boxplots** show **Statistcal Range**. They are the best tool to detect **Outliers** (extreme values).

💻 Example: Checking for Extreme Weights

sns.boxplot(x='body_mass_g', data=df)

# Any dots outside the whiskers are outliers.

📚 Module Summary

Categorical Data: Use value_counts() and sns.countplot().
Numerical Data: Use describe() for numbers, sns.histplot() for shape.
Outliers: Use sns.boxplot() to spot extreme values.
KDE: The smooth line that represents probability density.

🤔 Interview Q&A

Tap on the questions below to reveal the answers.

A **Bar Plot** is for Categorical data (gaps between bars). A **Histogram** is for Numerical data (bins touching each other) to show continuous frequency distribution.

**Skewness** measures asymmetry. **Right Skewed** means the tail extends to the right (Mean > Median). **Left Skewed** means the tail extends to the left (Mean < Median).

Boxplots are used to identify **Outliers** and see the **Interquartile Range (IQR)** (the middle 50% of data). They are robust summaries that ignore extreme noise.

Univariate Analysis (One Variable)