How to Automate Excel with Python openpyxl

Marcus Thorne

6 days ago

How to Automate Excel with Python openpyxl

To automate Python with openpyxl, install the library via pip install openpyxl. You can then load an existing Excel file using openpyxl.load_workbook('your_file.xlsx') or create a new one with openpyxl.Workbook(). Access worksheets and cells directly to read, write, and modify data before saving changes with workbook.save('updated_file.xlsx').

Metric	Value / Description
Core Language	Python 3.6+
Library Version	openpyxl 3.0.x to 3.1.x (tested against 3.1.2)
Supported Formats	`.xlsx` (Office Open XML format) exclusively. Does not support legacy `.xls`.
Memory Complexity (Reading)	O(N) for loading entire workbook (N = number of cells). O(1) per row/cell when using `read_only` mode.
Memory Complexity (Writing)	O(N) for holding workbook in memory before saving. O(1) per row when using `write_only` mode.
Performance Benchmark (Rough)	~100,000 cells/second for basic read/write on modern hardware; significantly slower with complex styling/formulas.
Key Use Cases	Automated report generation, data extraction, bulk updates, CI/CD integration for data validation.

The Senior Dev Hook

In my experience, dealing with Excel files manually is one of the most soul-crushing tasks in a developer’s career. Early on, before I discovered proper automation tools, I spent countless hours generating daily reports from CSV exports, copying data, applying formulas, and formatting cells – all by hand. It was a massive waste of time that often led to human error. That’s why I’m a firm believer in automating anything that involves repetitive data manipulation, especially when it comes to spreadsheet processing. When I first integrated Python with openpyxl into our CI/CD pipelines to validate build manifests distributed via Excel, it dramatically reduced our manual QA efforts and virtually eliminated data entry mistakes. It’s a game-changer for data-driven processes.

Under the Hood Logic: How openpyxl Works

openpyxl doesn’t rely on Microsoft Excel COM objects or other platform-specific APIs, which is a significant advantage. Instead, it interacts directly with the Office Open XML format (specifically, .xlsx files). An .xlsx file is essentially a ZIP archive containing multiple XML files that define the workbook’s structure, data, styles, and other properties.

When you use openpyxl.load_workbook(), the library unpacks this ZIP archive and parses the relevant XML files into a hierarchy of Python objects: a Workbook object contains Worksheet objects, which in turn contain Cell objects. This object model allows you to programmatically navigate, read, modify, and create these components. When you call workbook.save(), openpyxl re-serializes this object hierarchy back into the XML structure and zips it up into a new .xlsx file.

This approach makes openpyxl cross-platform, efficient, and robust. It’s a pure Python solution, meaning no external dependencies beyond the standard library (aside from lxml for faster XML parsing if available, but it’s not strictly required).

Step-by-Step Implementation

1. Installation

First, you need to install the openpyxl library. I always recommend using a virtual environment for project dependencies to avoid conflicts.

python3 -m venv venv_excel_automation
source venv_excel_automation/bin/activate
pip install openpyxl

2. Creating a New Excel Workbook and Writing Data

Let’s start by creating a new workbook, adding some data to a sheet, and saving it. This is fundamental for generating new reports or structured data exports.

# Filename: create_excel.py
from openpyxl import Workbook
from openpyxl.styles import Font, Alignment

def create_sample_excel(filename="report.xlsx"):
    """
    Creates a new Excel workbook, writes data, and applies basic formatting.
    """
    wb = Workbook() # Create a new workbook
    ws = wb.active # Get the active worksheet (by default, named 'Sheet')
    ws.title = "Sales Data" # Rename the worksheet

    # Write headers
    headers = ["Product ID", "Item Name", "Quantity", "Price", "Total"]
    ws.append(headers) # Use append to add a row of values

    # Apply bold font to headers
    for cell in ws[1]: # Iterate through cells in the first row
        cell.font = Font(bold=True)
        cell.alignment = Alignment(horizontal="center")

    # Add some data rows
    data = [
        [101, "Laptop", 5, 1200.00, "=C2*D2"], # Include a formula
        [102, "Mouse", 15, 25.50, "=C3*D3"],
        [103, "Keyboard", 10, 75.00, "=C4*D4"],
        [104, "Monitor", 3, 300.00, "=C5*D5"],
    ]
    for row_data in data:
        ws.append(row_data)

    # Adjust column widths automatically for better readability
    for column in ws.columns:
        max_length = 0
        column_letter = column[0].column_letter # Get the column letter (e.g., 'A')
        for cell in column:
            try: # Handle potential non-string types in cells
                if len(str(cell.value)) > max_length:
                    max_length = len(str(cell.value))
            except TypeError:
                pass
        adjusted_width = (max_length + 2) # Add a small buffer
        ws.column_dimensions[column_letter].width = adjusted_width

    wb.save(filename) # Save the workbook
    print(f"'{filename}' created successfully with sample data.")

if __name__ == "__main__":
    create_sample_excel()

In this code, Workbook() initializes an empty workbook. wb.active gives you the default sheet. We then use ws.append() to add rows of data, which is robust as it automatically adds to the next available row. Crucially, Excel formulas are stored as strings (e.g., "=C2*D2") and openpyxl handles them correctly.

3. Reading and Modifying an Existing Excel Workbook

Most automation tasks involve processing existing data. Here, we’ll load the file we just created, read some values, make modifications, and save it under a new name.

# Filename: process_excel.py
from openpyxl import load_workbook

def process_existing_excel(input_filename="report.xlsx", output_filename="processed_report.xlsx"):
    """
    Loads an existing Excel workbook, reads data, modifies it, and saves changes.
    """
    try:
        wb = load_workbook(input_filename) # Load the existing workbook
    except FileNotFoundError:
        print(f"Error: The file '{input_filename}' was not found.")
        return
    except Exception as e:
        print(f"An error occurred loading '{input_filename}': {e}")
        return

    ws = wb["Sales Data"] # Access the specific worksheet by name

    print(f"\n--- Reading data from '{input_filename}' ---")
    # Iterate through rows, skipping the header (row_idx > 1)
    for row_idx, row in enumerate(ws.iter_rows(min_row=2, values_only=True), start=2):
        product_id, item_name, quantity, price, total_formula = row
        # Note: values_only=True returns cell values, not cell objects.
        # If you need cell objects for modification, omit values_only.
        print(f"Row {row_idx}: ID={product_id}, Item={item_name}, Qty={quantity}, Price={price}, Total Formula='{total_formula}'")

    print("\n--- Modifying data ---")
    # Example: Update the price of "Mouse"
    for row_idx, row in enumerate(ws.iter_rows(min_row=2), start=2):
        item_name_cell = row[1] # Item Name is in the second column (index 1)
        if item_name_cell.value == "Mouse":
            new_price = 29.99
            row[3].value = new_price # Price is in the fourth column (index 3)
            # The total formula will automatically re-evaluate in Excel when opened
            print(f"Updated price for {item_name_cell.value} to {new_price}")
            break # Assuming unique item names for simplicity

    # Example: Add a new column for "Discount %" and apply a fixed discount
    ws.cell(row=1, column=6, value="Discount %").font = Font(bold=True) # Header for new column
    ws.cell(row=1, column=7, value="Net Total").font = Font(bold=True) # Header for new column

    for row_idx in range(2, ws.max_row + 1):
        # Apply a 10% discount for Laptops, 5% for others
        item_name = ws.cell(row=row_idx, column=2).value
        discount_percentage = 0.10 if item_name == "Laptop" else 0.05
        ws.cell(row=row_idx, column=6, value=discount_percentage) # Set discount percentage
        
        # Calculate Net Total using formula
        # Original Total is in column E (index 5), Discount % is in column F (index 6)
        # Assuming original total is `E{row_idx}`
        ws.cell(row=row_idx, column=7, value=f"=E{row_idx}*(1-F{row_idx})")


    wb.save(output_filename) # Save the modified workbook
    print(f"\nModified data saved to '{output_filename}'.")

if __name__ == "__main__":
    # Ensure the input file exists by running the create script first
    from create_excel import create_sample_excel
    create_sample_excel("report.xlsx") # Create the input file if it doesn't exist

    process_existing_excel("report.xlsx", "processed_report.xlsx")

Here, load_workbook() loads the file. You access specific sheets by name, like wb["Sales Data"]. When iterating rows for reading, iter_rows(values_only=True) is efficient as it directly yields cell values. For modification, you need to iterate without values_only=True to get Cell objects, which then allows you to set cell.value. Adding new columns involves setting headers and then iterating through data rows to populate them.

What Can Go Wrong (Troubleshooting)

While openpyxl is robust, certain scenarios can trip you up:

InvalidFileException / BadZipFile: This typically means you’re trying to open a file that isn’t a valid .xlsx file. Common causes include:
- Attempting to open a legacy .xls file (which openpyxl does not support).
- The .xlsx file is corrupted or incomplete.
- The file extension is misleading (e.g., a CSV file saved as .xlsx).
Fix: Ensure your input file is truly an .xlsx. For .xls files, you’d need libraries like xlrd or xlwt (though these are largely unmaintained for write operations for modern Python).
Memory Exhaustion on Large Files: Loading a very large Excel file (e.g., millions of cells) into memory using load_workbook() without specific options can consume vast amounts of RAM, potentially crashing your script.
Fix: For reading, use load_workbook(filename, read_only=True). This streams the data and significantly reduces memory usage. For writing, use Workbook(write_only=True) for similar memory benefits when generating massive reports.
Date/Time Handling: Excel stores dates as serial numbers (days since 1900-01-01 or 1904-01-01). While openpyxl usually converts these to Python datetime objects automatically, sometimes you might get raw numbers or string representations if formatting is inconsistent.
Fix: Explicitly check the cell’s data_type (e.g., cell.data_type == 'd' for date). If you receive a number, use openpyxl.utils.datetime.from_excel(excel_date_number) to convert it.
Formula Evaluation: By default, openpyxl reads formulas as strings (e.g., "=SUM(A1:A10)"). It does not evaluate them. If you need the *result* of a formula, you’d typically open the file in Excel, save it, and then re-read it, or explicitly pass data_only=True to load_workbook().
Fix: Use load_workbook(filename, data_only=True) if you only need the cached values of formulas. Be aware that this might give outdated results if the workbook hasn’t been opened and saved by Excel recently.

Performance & Best Practices

To ensure your automation scripts are efficient and scalable, especially with large datasets, consider these best practices:

Use read_only and write_only modes:
For reading large files (hundreds of thousands of rows or more), load_workbook(filename, read_only=True) is crucial. It loads worksheets in a streamed fashion, consuming significantly less memory. You then iterate over rows using ws.iter_rows().

For writing very large files, create the workbook with wb = Workbook(write_only=True). You can then append rows using ws.append(), and the data is written directly to an optimized XML structure without holding the entire workbook in memory. Note that write_only mode has limitations (e.g., no cell access by coordinate after appending).
Batch Updates: Avoid writing cell-by-cell in a loop if you’re dealing with many cells in an existing workbook. While sometimes necessary, appending rows or using ws.iter_rows() for bulk updates is generally more performant. For modifications, load the workbook once, make all necessary changes in memory, and then save once.
Efficient Iteration: When reading, use ws.iter_rows() or ws.iter_cols() with values_only=True if you only need the cell values and not the Cell objects themselves. This is faster and less memory-intensive.
Avoid Unnecessary Styling: Complex cell styles, merged cells, and images can add overhead to file size and processing time. If performance is critical, keep styling minimal.
Comparison with Alternatives:
- pandas: For complex data analysis, cleaning, and manipulation, pandas is often a better choice. It can read/write Excel files (using openpyxl or xlrd/xlwt under the hood) and provides powerful DataFrame operations. If your task involves heavy data transformation, integrate pandas: read into a DataFrame, process, then write back.
- xlwings: If you need to interact with a running Excel instance (e.g., trigger macros, work with specific Excel objects, or need live bidirectional data exchange), xlwings is the go-to. However, it requires Excel to be installed and primarily works on Windows/macOS. openpyxl is purely for file manipulation.
CI/CD Integration: For tasks like automated report generation or data validation, containerize your Python script (e.g., using Docker) and integrate it into your CI/CD pipeline. This ensures a consistent execution environment and can be triggered on schedule or specific events.

For more on this, Check out more Automation Tutorials.

Author’s Final Verdict

As a DevOps engineer, I gravitate towards tools that are efficient, reliable, and platform-agnostic. openpyxl fits this bill perfectly for Excel automation. It’s a pure Python library that allows you to programmatically control every aspect of an .xlsx file without needing Excel installed. For tasks ranging from simple data extraction to generating complex, formatted reports within a server environment or a Kubernetes pod, openpyxl is my default choice. Its read_only and write_only modes make it surprisingly capable even for very large datasets, provided you understand its memory management. While pandas might be more suitable for heavy analytical workloads, openpyxl is the precise instrument for direct, programmatic manipulation of Excel file content and structure. Master it, and you’ll automate away a significant chunk of manual data overhead.