Merging & Joining
Combine multiple DataFrames using SQL-style Joins and Concatenation.
π Introduction: The "SQL" of Pandas
In the real world, data is rarely sitting in one single perfect Excel sheet. You usually have:
1. **Customer Data** in one table.
2. **Order Data** in another table.
To analyze them together (e.g., "Which customer bought the most items?"), you need to bring them together. In SQL, you call this a **JOIN**. In Pandas, we call this **Merge** and **Concat**.
π Topic 1: Concatenation (`pd.concat`)
Concatenation is like stacking boxes. It doesn't care about matching "keys" or "IDs". It just glues DataFrames together.
π» Example: Stacking Data
π€ Topic 2: Merging (`pd.merge`)
Merging is smarter. It looks for a common "Key" (like Employee_ID) and links rows based on that key. This is exactly like **SQL JOIN**.
Keeps only rows where the key exists in **BOTH** tables. (Intersection).
Keeps **ALL** rows from both tables. Fills missing matches with `NaN`. (Union).
Keeps all rows from the **Left** table, matches from Right. (e.g., All Customers, even if no Orders).
Keeps all rows from the **Right** table, matches from Left.
π» Example: Merge Syntax
left_df,
right_df,
on='employee_id',
how='inner' # inner, outer, left, right
)
π Topic 3: Keys & Indicators
Sometimes your data is messy. The Employee ID might be called `emp_id` in one table and `id` in another.
- left_on / right_on: Use these if column names differ.
- suffixes: If both tables have a column named "Name", Pandas creates `Name_x` and `Name_y`. You can customize this.
- indicator=True: Adds a column called `_merge` that tells you if the row came from "left_only", "right_only", or "both". Very useful for debugging!
π Module Summary
- pd.concat: Glues tables together (Rows or Columns). Dumb but fast.
- pd.merge: Joins tables based on logic (Keys). Smart and powerful.
- Inner Join: Only matches.
- Left/Right Join: Prioritizes one table.
- Outer Join: Keeps everything.
π€ Interview Q&A
Tap on the questions below to reveal the answers.
Concat stacks DataFrames on top of each other or side-by-side based on index/axis. It doesn't look into the actual data content.
Merge combines DataFrames based on values in common columns (keys), similar to SQL JOINs.
Pandas automatically adds suffixes `_x` and `_y` to overlapping column names. You can control this using the `suffixes=('_left', '_right')` argument in `pd.merge()`.
A Cross Join (Cartesian Product) combines every row of the left table with every row of the right table. In Pandas, you can do this by assigning a temporary common key (like `key=1`) to both and merging, or using `how='cross'` in newer Pandas versions.