Merging and Joining DataFrames

Combine and reshape data using keys and indices with Pandas.

Data Science Core Intermediate 60 min

🤝 Introduction: The Relational Data Model

In Data Science, information is often split across multiple tables to avoid redundancy (Normalization). For instance, customer details are separate from their monthly transactions. To get a complete view, you must link these tables.

**Merging** and **Joining** refer to combining DataFrames horizontally based on common column values (known as **keys**), similar to SQL operations.

**Concatenation** is different; it is primarily used for stacking DataFrames vertically (appending rows) or side-by-side (appending columns) without key-based alignment.

[Image of SQL join venn diagram]

🎯 Topic 1: The Core Merge Function (`pd.merge`)

The `pd.merge()` function is the primary tool for combining DataFrames based on values in shared columns. It offers full control over the relational join operation.

📝 Key `pd.merge()` Arguments:

                pd.merge(left_df, right_df,

                on='key_col',

                how='inner',

                suffixes=('_x', '_y'))

**1. `on`:** Single column name or a list of column names (e.g., `['ID', 'Date']`) to join on.
**2. `left_on`/`right_on`:** Used if the joining column names are different in the two DataFrames.
**3. `how`:** Specifies the type of join (Inner, Left, Right, Outer).

💻 Example: Inner Joinimportpandasaspd
df_A = pd.DataFrame({'ID': [1, 2], 'Dept': ['HR', 'IT']})
df_B = pd.DataFrame({'ID': [2, 3], 'Salary': [70000, 80000]})

inner_result = pd.merge(df_A, df_B, on='ID', how='inner')
print(inner_result)
            
# Output: Only ID 2 matches, so only that row is kept.
#    ID Dept  Salary
# 0   2   IT   70000

🔗 Topic 2: Understanding Join Types (`how`)

Choosing the correct join type is crucial as it determines which records are included in the final result and where missing values (`NaN`) are introduced.

📘 The Four Merging Styles:

Inner Join (`inner`)

Keeps the **intersection** of keys. Only matched rows from both sides are kept. (Default setting if 'how' is omitted).

Left Join (`left`)

Keeps **all rows from the left** DataFrame. Missing matches on the right side are filled with `NaN`.

Right Join (`right`)

Keeps **all rows from the right** DataFrame. Missing matches on the left side are filled with `NaN`.

Outer Join (`outer`)

Keeps the **union** of keys. All rows from both DataFrames are kept, filling unmatched columns with `NaN`.

🗂️ Topic 3: Index-Based Joining (`df.join()`)

The `DataFrame.join()` method is a simplified tool that is primarily designed for joining data based on the DataFrame's **index**. It's often cleaner and faster for cases where the index is the key.

Key `df.join()` Features:

**Join Key:** Defaults to joining the *right* DataFrame to the *left* DataFrame's **index**.
**`on` argument:** Allows you to specify a column in the *left* DataFrame to join to the *right* DataFrame's index.
**`lsuffix`/`rsuffix`:** Mandatory if non-index/key columns share the same name.

            💻 Example: Joining on Index
            df_base = pd.DataFrame({'Value': [10, 20, 30]}, index=['A', 'B', 'C'])

            df_meta = pd.DataFrame({'Desc': ['Good', 'Bad']}, index=['A', 'B'])

            joined_result = df_base.join(df_meta, how='left')

            print(joined_result)

↕️ Topic 4: Concatenation (`pd.concat()`)

Concatenation is used when combining DataFrames that share common columns (for `axis=0`) or common indices (for `axis=1`), without performing a relational check for matching key values across rows.

The `axis` Argument Controls Direction:

**`axis=0` (Default):** **Rows** are stacked vertically (e.g., combining Q1 data with Q2 data). Columns are aligned by name.
**`axis=1`:** **Columns** are lined up horizontally (e.g., adding a new feature). Rows are aligned by index.
**`join` argument:** Defaults to `join='outer'` (union of columns/indices). Use `join='inner'` to keep only common columns/indices.

💻 Example: Vertical Concatenationdf_Q1 = pd.DataFrame({'A': [1], 'B': [2]})
df_Q2 = pd.DataFrame({'A': [3], 'B': [4]})

stacked = pd.concat([df_Q1, df_Q2], ignore_index=True)
print(stacked)
            
# Output:
#    A  B
# 0  1  2
# 1  3  4

📚 Module Summary

`pd.merge()`: Primary method for relational joins on specific **key columns**.
`how` Argument: Defines the join logic (`inner`, `left`, `right`, `outer`).
`df.join()`: Simplifies joining the right DataFrame to the left DataFrame's **index**.
`pd.concat()`: Used for stacking rows (`axis=0`) or columns (`axis=1`), aligning based on index/column names, not keys.
Conflicting Names: `pd.merge()` adds `_x` and `_y` suffixes by default for non-key columns with the same name.

🤔 Interview Q&A

Tap on the questions below to reveal the answers.

**Merging (`pd.merge`)** combines based on key values (like `CustomerID`) to link related data horizontally. **Concatenation (`pd.concat`)** combines by position, stacking rows or aligning columns side-by-side without checking key relationships.

An **Outer Join** (how='outer') should be used when you need to keep **all** records from both DataFrames, regardless of whether a match exists. This is useful for identifying non-matched records or getting a full census of data.

You use the **`left_on`** and **`right_on`** arguments in `pd.merge()`. For example, if the key is `EmpID` in the left table and `WorkerID` in the right table: pd.merge(df_L, df_R, left_on='EmpID', right_on='WorkerID').

Pandas automatically handles the conflict by appending suffixes, usually _x for the left DataFrame and _y for the right DataFrame (e.g., a column named `Age` becomes `Age_x` and `Age_y`). These suffixes can be customized using the `suffixes` parameter.

Merging and Joining DataFrames

🤝 Introduction: The Relational Data Model

🎯 Topic 1: The Core Merge Function (`pd.merge`)

📝 Key `pd.merge()` Arguments:

💻 Example: Inner Join

🔗 Topic 2: Understanding Join Types (`how`)

📘 The Four Merging Styles:

🗂️ Topic 3: Index-Based Joining (`df.join()`)

Key `df.join()` Features:

💻 Example: Joining on Index

↕️ Topic 4: Concatenation (`pd.concat()`)

The `axis` Argument Controls Direction:

💻 Example: Vertical Concatenation

📚 Module Summary

🤔 Interview Q&A

Practice Question

Loading Question...

Output Console