DSPython Logo DSPython

Univariate Analysis

Analyze, summarize, and visualize single variables using Pandas and Seaborn.

Data Visualization Beginner 45 min

πŸ‘€ Introduction: The Individual Player Report

Univariate Analysis is the starting point for any EDA. It means analyzing a dataset where you focus on **one variable (column) at a time**.

Imagine you are reviewing a cricket team's performance. Univariate analysis means checking one player's stats (e.g., just his runs, or just his wickets) before comparing him to the rest of the team. The goal is to describe the single variable's distribution and find any issues (like outliers).

[Image of data cleaning process flowchart]

πŸ”’ Topic 1: Analyzing Numerical Variables (Age, Price)

For numerical data, we use statistics and plots to understand four key characteristics:

πŸ“ˆ Statistical Goals:

  • Central Tendency: Mean and Median (What's the typical value?). **Median** is better if data has outliers.
  • Spread: Standard Deviation and IQR (How varied is the data?).
  • Distribution: Skewness (Is data piled on one side?).

πŸ’» Example: The describe() Command

data = {'age': [22, 25, 30, 30, 35, 40, 50, 85]}
df_example = pd.DataFrame(data)

print(df_example['age'].describe())
# .describe() output includes count, mean, std, min, max, and quartiles.

πŸ“‰ Visualization Tools:

  • **Histogram (sns.histplot):** Shows the **shape** and **frequency** of the distribution. Essential for checking skewness.
  • **Boxplot (sns.boxplot):** Shows the quartiles and **outliers**. Essential for identifying extreme values.
[Image of a boxplot chart showing quartiles and outlier points]

πŸ”  Topic 2: Analyzing Categorical Variables (City, Gender)

Categorical data is non-numerical. We are only interested in **counts** and **proportions** of each group.

πŸ“Š Analytical Goal:

Find the **Mode** (most frequent category) and the **Frequency** (count of each category).

πŸ’» Example: Counting Frequencies

data = {'gender': ['Male', 'Female', 'Male', 'Male', 'Female']}
df_example = pd.DataFrame(data)

print(df_example['gender'].value_counts())
# This command is the primary tool for categorical analysis.

πŸ“ˆ Visualization Tools:

**Count Plot (sns.countplot):** A Bar Chart that automatically plots the results of .value_counts().

sns.countplot(x=df_example['gender'])

🧠 Topic 3: Advanced Shapes: Skewness & Kurtosis

These two measures provide numerical values that describe the **shape** of your numerical distribution:

1. Skewness:

Measures the **asymmetry** of the distribution. If the tail is pulled to the right (positive skew), the Mean > Median. If the tail is pulled to the left (negative skew), the Mean < Median.

2. Kurtosis:

Measures the **tailedness** (how fat or thin the tails are) and **peakedness** of the distribution compared to a normal distribution.

πŸ’» Example: Calculating Shape

# Python function to calculate the shape measures
print("Skewness:", df['age'].skew())
print("Kurtosis:", df['age'].kurt())

πŸ“š Module Summary

  • Numerical: Look for Mean, Median, Spread (STD/IQR). Plot with Histograms and Boxplots.
  • Categorical: Look for Frequencies and Mode. Plot with Count Plots.
  • Key Command: .describe() for numbers, .value_counts() for categories.

πŸ€” Interview Q&A

Tap on the questions below to reveal the answers.

In a skewed distribution (e.g., highly positive skew), the Mean is pulled towards the long tail and is higher than the Median. The Median is considered a better, more robust measure of central tendency because it is not affected by outliers.

Use a Histogram for continuous numerical data (e.g., Age) to show the distribution/shape. Use a Bar Plot for categorical data (e.g., City) to show the distinct counts of each group.

If skewness is close to 0, it means the data distribution is relatively symmetrical (like a perfect bell curve). If skewness is positive (e.g., +2), it means the data is skewed right.

df.describe() only calculates mean, standard deviation, etc., which are meaningless for text data. For categorical columns, it only provides count and unique values. You need df.value_counts() to get the crucial frequency information.

πŸ€–
DSPython AI Assistant βœ–
πŸ‘‹ Hi! I’m your AI assistant. Paste your code here, I will find bugs for you.