How to Merge DataFrames in Pandas

Priya Patel

1 month ago

How to Merge DataFrames in Pandas

Merging DataFrames in Pandas is a core operation for combining datasets, akin to SQL JOINs. Use pd.merge(left_df, right_df, on='common_key', how='inner') to efficiently combine two DataFrames. The on parameter specifies the key column(s), and how dictates the merge type (e.g., ‘inner’, ‘left’, ‘right’, ‘outer’).

Metric	Details
Pandas Versions	`pd.merge()` has been a stable API since Pandas 0.17.0. Minor enhancements and bug fixes continuously occur.
Time Complexity	Typically O(N+M) for hash-based merges (default), where N and M are the number of rows in the left and right DataFrames respectively. Worst-case can be higher for specific data distributions or join types.
Space Complexity	O(K) where K is the number of rows in the result DataFrame. Additional temporary space for hash tables is also used, proportional to the keys being merged.
Key Parameters	`on`, `left_on`, `right_on`, `how`, `suffixes`, `indicator`.
Memory Footprint	Can be substantial for large DataFrames, especially with ‘outer’ merges that generate a union of all keys, or if a Cartesian product is unintentionally created.
Typical Use Cases	Combining relational data, enriching datasets, performing SQL-like joins between tables.

The Senior Dev Hook

When I first moved from SQL to Python for data manipulation, I found myself constantly reaching for pd.merge. It’s the workhorse for combining data. However, early on, I learned the hard way that not understanding the subtle differences between how='left' and how='inner' could lead to silently losing critical data points or, conversely, bloating my dataset with unintended duplicates. Always start with a clear understanding of your join type and key columns; that precision saves hours of debugging down the line.

Under the Hood Logic: How Pandas Merges DataFrames

At its core, pd.merge() is Pandas’ robust implementation of database-style join operations. It intelligently combines two DataFrames based on common columns or indices. The mechanism primarily relies on hash tables for efficiency, similar to how modern relational databases perform joins.

Here’s a breakdown of the common how types:

how='inner' (default): This produces only the rows where the merge key exists in both the left and right DataFrames. It’s like finding the intersection of your datasets based on the key.
how='left': This includes all rows from the left DataFrame and matching rows from the right DataFrame. If a key in the left DataFrame doesn’t have a match in the right, the columns from the right DataFrame will be filled with NaN.
how='right': The inverse of a left merge. It includes all rows from the right DataFrame and matching rows from the left DataFrame. Unmatched keys from the right will have NaN in the left DataFrame’s columns.
how='outer': This produces a union of keys, including all rows from both DataFrames. Where there’s no match, NaN is used to fill in the missing values. This can significantly increase the number of rows if many keys are unique to each DataFrame.

When you specify key columns using on, Pandas will construct a hash table from one of the DataFrames (typically the smaller one for efficiency) with the merge keys. Then, it iterates through the other DataFrame, probing the hash table for matches. This hash-based approach is why pd.merge() achieves its excellent average-case time complexity.

Step-by-Step Implementation

Let’s walk through combining two DataFrames using various merge strategies. First, we need some sample data.


import pandas as pd

# DataFrame 1: Employee information
df_employees = pd.DataFrame({
    'employee_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'department_id': [101, 102, 101, 103, 102]
})

# DataFrame 2: Department information
df_departments = pd.DataFrame({
    'department_id': [101, 102, 103, 104],
    'department_name': ['Engineering', 'Marketing', 'Sales', 'HR']
})

print("df_employees:")
print(df_employees)
print("\ndf_departments:")
print(df_departments)

Output:


df_employees:
   employee_id     name  department_id
0            1    Alice            101
1            2      Bob            102
2            3  Charlie            101
3            4    David            103
4            5      Eve            102

df_departments:
   department_id department_name
0            101     Engineering
1            102       Marketing
2            103         Sales
3            104            HR

1. Inner Merge (Default Behavior)

An inner merge keeps only the rows where the merge key (department_id) is present in both DataFrames.


# Inner merge on 'department_id'
merged_inner = pd.merge(df_employees, df_departments, on='department_id', how='inner')
print("\nInner Merge (merged_inner):")
print(merged_inner)

Explanation: Employee ‘David’ (department_id 103) is matched with ‘Sales’. No employee is in ‘HR’ (department_id 104), so ‘HR’ department is not included. Similarly, if an employee had a department_id not present in df_departments, they would be excluded.

2. Left Merge

A left merge retains all rows from the left DataFrame (df_employees) and adds matching information from the right DataFrame. If a department_id from df_employees has no match in df_departments, NaN will fill the department_name column for that row.


# Left merge on 'department_id'
merged_left = pd.merge(df_employees, df_departments, on='department_id', how='left')
print("\nLeft Merge (merged_left):")
print(merged_left)

Explanation: In this specific example, all employee department IDs have a match. If we added an employee with department_id=999, that row would still appear, but department_name would be NaN.

3. Right Merge

A right merge retains all rows from the right DataFrame (df_departments) and adds matching information from the left DataFrame. If a department_id from df_departments has no matching employee, NaN will fill the employee-related columns.


# Right merge on 'department_id'
merged_right = pd.merge(df_employees, df_departments, on='department_id', how='right')
print("\nRight Merge (merged_right):")
print(merged_right)

Explanation: The ‘HR’ department (ID 104) is included because it’s in df_departments. Since no employee is linked to HR, employee_id and name are NaN for that row.

4. Outer Merge

An outer merge combines all rows from both DataFrames, creating a union of keys. NaN values are inserted wherever a match is not found.


# Outer merge on 'department_id'
merged_outer = pd.merge(df_employees, df_departments, on='department_id', how='outer')
print("\nOuter Merge (merged_outer):")
print(merged_outer)

Explanation: All employees and all departments are present. The ‘HR’ department has NaN for employee details, and all employees have a department name. This is useful for comprehensive analysis where you need to see all possible combinations.

5. Merging on Different Key Column Names

Sometimes the join keys have different names in the two DataFrames. Use left_on and right_on.


df_projects = pd.DataFrame({
    'project_id': [1001, 1002, 1003],
    'manager_id': [1, 3, 2], # Corresponds to employee_id in df_employees
    'project_name': ['Alpha', 'Beta', 'Gamma']
})

# Merge df_employees with df_projects using different key names
merged_projects = pd.merge(df_employees, df_projects, left_on='employee_id', right_on='manager_id', how='inner')
print("\nMerged with different key names (merged_projects):")
print(merged_projects)

6. Handling Overlapping Columns with `suffixes`

If both DataFrames have columns with the same name (other than the merge key), Pandas will automatically append _x and _y. You can control this behavior with the suffixes parameter.


df_performance = pd.DataFrame({
    'employee_id': [1, 2, 3, 6], # Note: employee_id 6 is new
    'score': [90, 85, 92, 78],
    'quarter': ['Q1', 'Q1', 'Q1', 'Q1']
})

# Merge with overlapping 'score' column without suffixes
merged_overlap_default = pd.merge(df_employees, df_performance, on='employee_id', how='left')
print("\nMerged with overlapping columns (default suffixes):")
print(merged_overlap_default)

# Merge with custom suffixes
merged_overlap_custom = pd.merge(
    df_employees, df_performance,
    on='employee_id',
    how='left',
    suffixes=('_info', '_perf') # Custom suffixes for 'score' column
)
print("\nMerged with custom suffixes:")
print(merged_overlap_custom)

Explanation: The score column from df_employees (if it existed) would get _info and the score from df_performance gets _perf. Since df_employees doesn’t have a ‘score’ column, only the ‘score_perf’ from df_performance is created, and it shows NaN for employee 6 (who is not in df_employees).

What Can Go Wrong (Troubleshooting)

Merging DataFrames, while powerful, comes with its own set of common pitfalls:

KeyError for on/left_on/right_on: This occurs if the column you specify for merging does not exist in the respective DataFrame. Always double-check column names, including case sensitivity.
Unexpected Number of Rows:
- Too Few Rows: Often a sign of an inner merge when you intended a left or outer. It means many keys did not match.
- Too Many Rows: This is dangerous and usually indicates duplicate keys in one or both DataFrames without a clear many-to-many relationship being intended. If you have duplicate keys on one side and merge on that key, you’ll get a Cartesian product for those duplicates. For example, if df_employees had two entries for employee_id=1, and df_performance also had two entries for employee_id=1, an inner merge would produce four rows for employee_id=1.
Silent Data Loss/Discrepancies: If you use an inner merge and some keys only exist in one DataFrame, those rows are silently dropped. Always be explicit with your how parameter and verify your results.
NaN Values in Critical Columns: This usually means a key existed in one DataFrame but not the other for an outer, left, or right merge. Verify if this is expected.
Type Mismatches in Merge Keys: Pandas is usually intelligent, but merging on columns with fundamentally different data types (e.g., string vs. integer representations of IDs) can lead to no matches found or unexpected behavior. Always ensure your merge keys have consistent dtype using df['column'].astype(str) or .astype(int) if necessary.
Memory Exhaustion (Out of Memory Errors): Merging very large DataFrames, especially with outer joins, can quickly consume available RAM. This is particularly true if the intermediate hash tables or the resulting DataFrame exceed system memory.

Performance & Best Practices

Knowing when and how to use pd.merge optimally is key to efficient data processing.

When NOT to Use `pd.merge`

Simple Concatenation (Stacking): If you simply want to stack DataFrames vertically (add rows) or horizontally (add columns, assuming indices align), use pd.concat(). It’s designed for these “union” operations and is often faster for such cases.
Index-Based Joins: If your intention is to join on the DataFrame indices, the .join() method (e.g., df1.join(df2)) is often more concise and performant than specifying left_index=True, right_index=True in pd.merge().
Data Too Large for Memory: For datasets that exceed your system’s RAM, pd.merge() will fail. Consider distributed computing frameworks like Dask DataFrames or PySpark DataFrames, which offer similar merge functionalities but operate on disk or across clusters.

Alternative Methods and Comparison

DataFrame.join(): This method is specifically for combining DataFrames by index or by a single key column in the right DataFrame onto the left DataFrame’s index. It’s often cleaner for index-based joins and implicitly performs a left join by default.


# Example: df_employees.join(df_departments.set_index('department_id'), on='department_id')
# This joins df_employees (left) with df_departments (right) based on department_id
# where department_id in df_employees matches the index of df_departments.

pd.concat(): For simply stacking rows or columns. It doesn’t perform intelligent key-based matching like merge.


# Example: pd.concat([df1, df2], axis=0) # Stacks rows
# Example: pd.concat([df1, df2], axis=1) # Stacks columns, requires index alignment

Best Practices for Merging

Understand Your Keys: Ensure your merge keys are unique in at least one DataFrame (if possible) to prevent unintended Cartesian products. If not unique, be aware of how duplicates will affect the result.
Check dtype Consistency: Before merging, always check that the data types of your merge keys are consistent across both DataFrames using df.dtypes. If not, use .astype() to convert them.
Specify how Explicitly: Never rely on the default how='inner' unless it’s explicitly your intention. Make your merge type clear to avoid subtle data loss.
Use suffixes for Overlapping Columns: Always define suffixes if there’s a chance of non-merge columns having the same name. This improves readability and prevents default _x/_y suffixes, which can be confusing.
Leverage indicator=True for Debugging: For complex merges, adding indicator=True to pd.merge() will add a special column (_merge by default) to the output indicating where each row came from (‘left_only’, ‘right_only’, ‘both’). This is invaluable for verifying join correctness.
Filter Before Merging: If you only need a subset of data from either DataFrame, filter it *before* the merge operation. This reduces the size of DataFrames involved in the merge, saving memory and improving performance.

For more on this, Check out more Data Science Tutorials.

Author’s Final Verdict

pd.merge() is a cornerstone of data wrangling in Python, offering powerful and flexible ways to combine disparate datasets. In my experience, mastering the different join types (inner, left, right, outer) and understanding how key columns, overlapping names, and data types influence the outcome is paramount. Always prioritize clarity and precision in your merge operations. Validate your merged DataFrames, especially their shape and key column content, before proceeding with further analysis. This pragmatic approach will save you countless headaches and ensure the integrity of your data pipelines.