DSPython Logo DSPython

Merging & Joining

Combine multiple DataFrames using SQL-style Joins and Concatenation.

Data Science Intermediate 50 min

πŸ”— Introduction: The "SQL" of Pandas

In the real world, data is rarely sitting in one single perfect Excel sheet. You usually have:
1. **Customer Data** in one table.
2. **Order Data** in another table.

To analyze them together (e.g., "Which customer bought the most items?"), you need to bring them together. In SQL, you call this a **JOIN**. In Pandas, we call this **Merge** and **Concat**.

πŸ“š Topic 1: Concatenation (`pd.concat`)

Concatenation is like stacking boxes. It doesn't care about matching "keys" or "IDs". It just glues DataFrames together.

πŸ’» Example: Stacking Data

# Stacking vertically (Adding more rows)
df_combined = pd.concat([df1, df2], axis=0)

# Stacking horizontally (Adding more columns)
df_wide = pd.concat([df1, df2], axis=1)

🀝 Topic 2: Merging (`pd.merge`)

Merging is smarter. It looks for a common "Key" (like Employee_ID) and links rows based on that key. This is exactly like **SQL JOIN**.

Inner Join (Default)

Keeps only rows where the key exists in **BOTH** tables. (Intersection).

Outer Join

Keeps **ALL** rows from both tables. Fills missing matches with `NaN`. (Union).

Left Join

Keeps all rows from the **Left** table, matches from Right. (e.g., All Customers, even if no Orders).

Right Join

Keeps all rows from the **Right** table, matches from Left.

πŸ’» Example: Merge Syntax

# Merging on a common column 'key'
result = pd.merge(
    left_df,
    right_df,
    on='employee_id',
    how='inner' # inner, outer, left, right
)

πŸ”‘ Topic 3: Keys & Indicators

Sometimes your data is messy. The Employee ID might be called `emp_id` in one table and `id` in another.

  • left_on / right_on: Use these if column names differ.
  • suffixes: If both tables have a column named "Name", Pandas creates `Name_x` and `Name_y`. You can customize this.
  • indicator=True: Adds a column called `_merge` that tells you if the row came from "left_only", "right_only", or "both". Very useful for debugging!

πŸ“š Module Summary

  • pd.concat: Glues tables together (Rows or Columns). Dumb but fast.
  • pd.merge: Joins tables based on logic (Keys). Smart and powerful.
  • Inner Join: Only matches.
  • Left/Right Join: Prioritizes one table.
  • Outer Join: Keeps everything.

πŸ€” Interview Q&A

Tap on the questions below to reveal the answers.

Concat stacks DataFrames on top of each other or side-by-side based on index/axis. It doesn't look into the actual data content.
Merge combines DataFrames based on values in common columns (keys), similar to SQL JOINs.

Pandas automatically adds suffixes `_x` and `_y` to overlapping column names. You can control this using the `suffixes=('_left', '_right')` argument in `pd.merge()`.

A Cross Join (Cartesian Product) combines every row of the left table with every row of the right table. In Pandas, you can do this by assigning a temporary common key (like `key=1`) to both and merging, or using `how='cross'` in newer Pandas versions.

πŸ€–
DSPython AI Assistant βœ–
πŸ‘‹ Hi! I’m your AI assistant. Paste your code here, I will find bugs for you.