Python Pandas: How to Fix "ValueError: Excel file format cannot be determined, you must specify an engine manually"

When attempting to read Excel files into a Pandas DataFrame using pd.read_excel(), you might encounter the ValueError: Excel file format cannot be determined, you must specify an engine manually. This error typically indicates that Pandas is unable to automatically identify the specific Excel file format (.xls, .xlsx, etc.) or that the file it's trying to read isn't a valid Excel file at all, possibly due to incorrect file extensions or interference from temporary system files.

This guide will thoroughly explain the common causes behind this ValueError, including issues with temporary lock files created by operating systems, using the wrong Pandas read function for the file type (e.g., read_excel for a CSV), and mislabeled file extensions. You'll learn how to correctly specify the Excel engine, handle temporary files, and ensure you're using the appropriate function for your file type.

Understanding the Error: Pandas and Excel Engines

Pandas relies on external libraries, called "engines," to read and write Excel files. Different Excel file formats require different engines:

.xlsx (Excel XML format, common for modern Excel): Typically requires the openpyxl engine.
.xls (older binary Excel 97-2003 format): Typically requires the xlrd engine. (Note: Recent xlrd versions have limitations with .xlsx files).
Other formats like .xlsm, .xltx, .xltm also usually use openpyxl.

The error "Excel file format cannot be determined, you must specify an engine manually" means pd.read_excel() couldn't automatically figure out which engine to use, often because the file doesn't appear to be a recognizable Excel format or it's ambiguous.

Common Cause 1: Temporary Lock Files Created by Operating Systems

How Lock Files Cause the Error

When you have an Excel file open in Microsoft Excel (or sometimes other applications), the operating system often creates a temporary, hidden "lock" file in the same directory. These files usually have names starting with ~$ (e.g., ~$MyData.xlsx). If your Python script tries to read all files with an .xlsx extension in a directory, it might attempt to read these lock files. Since lock files are not valid Excel files, pd.read_excel() fails and throws the error.

Solution: Ignoring Temporary Lock Files (e.g., `~$filename.xlsx`)

When processing multiple Excel files in a directory, explicitly skip these temporary lock files.

In the following example, the code iterates through .xlsx files, checks if the filename starts with ~$, and skips it if it does.

import pandas as pd
import glob # For finding files matching a pattern
import os   # For path operations

# Create a dummy Excel file for this example
# In a real scenario, you'd have your actual Excel files
dummy_excel_content = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
dummy_excel_content.to_excel('example_data.xlsx', index=False, engine='openpyxl')
# Simulate a lock file (usually created by Excel being open)
# For this script, we'll create it manually for demonstration
# On Windows, these might be hidden.
try:
    with open('~$example_data.xlsx', 'w') as f_lock:
        f_lock.write("lock file content") # Content doesn't matter, just existence
except PermissionError:
    print("Could not create dummy lock file due to permissions, skipping that part of demo.")


excel_files_found = []
for file_path in glob.glob('./*.xlsx'): # Get all .xlsx files in current directory
    file_name = os.path.basename(file_path)
    # ✅ Skip temporary lock files
    if file_name.startswith('~$'):
        print(f"Skipping temporary lock file: {file_name}")
        continue
    else:
        print(f"Attempting to read: {file_name}")
        try:
            # For .xlsx, specify engine='openpyxl'
            df = pd.read_excel(file_path, engine='openpyxl')
            excel_files_found.append(df)
            print(f"Successfully read: {file_name}")
            print(df.head(2))
        except ValueError as e:
            if "Excel file format cannot be determined" in str(e):
                print(f"Error reading {file_name}: {e}. Might be a lock file or not a valid Excel file.")
            else:
                print(f"Other ValueError for {file_name}: {e}")
        except Exception as e_gen:
            print(f"General error reading {file_name}: {e_gen}")


# Clean up dummy files
if os.path.exists('example_data.xlsx'): os.remove('example_data.xlsx')
if os.path.exists('~$example_data.xlsx'): os.remove('~$example_data.xlsx')

if not excel_files_found:
    print("No valid Excel files were processed successfully.")

Output:

Attempting to read: example_data.xlsx
Successfully read: example_data.xlsx
   col1  col2
0     1     3
1     2     4
Skipping temporary lock file: ~$example_data.xlsx

Common Cause 2: Using the Wrong Pandas Read Function (e.g., `read_excel` for CSV)

The Importance of Matching Function to File Type

Pandas has specific functions for different file types:

pd.read_excel(): For Excel files (.xls, .xlsx).
pd.read_csv(): For Comma Separated Values files (.csv), or other delimited text files if sep is specified.

Using pd.read_excel() on a .csv file, or pd.read_csv() on an .xlsx file, will almost certainly lead to errors, including the "Excel file format cannot be determined" error if read_excel can't make sense of the CSV structure as an Excel format.

Solution: Use `pd.read_csv()` for `.csv` and `pd.read_excel()` for `.xls`/`.xlsx`

Always use the correct Pandas function for your file type.

Programmatically Checking File Extensions

If you are processing files with unknown or mixed extensions, you can check the extension first.

import pandas as pd
import os

def read_data_file(file_path):
    file_name, file_extension = os.path.splitext(file_path)

    if file_extension.lower() == '.xlsx':
        print(f"Reading as .xlsx: {file_path}")
        return pd.read_excel(file_path, engine='openpyxl')
    elif file_extension.lower() == '.xls':
        print(f"Reading as .xls: {file_path}")
        return pd.read_excel(file_path, engine='xlrd') # xlrd needed for .xls
    elif file_extension.lower() == '.csv':
        print(f"Reading as .csv: {file_path}")
        return pd.read_csv(file_path)
    else:
        print(f"Unsupported file extension: {file_extension} for file {file_path}")
        return None

# Example usage:
# Create dummy files for testing
pd.DataFrame({'A':[1]}).to_excel('test.xlsx', index=False, engine='openpyxl')
pd.DataFrame({'B':[2]}).to_csv('test.csv', index=False)

df_excel = read_data_file('test.xlsx')
if df_excel is not None: print(df_excel)

df_csv = read_data_file('test.csv')
if df_csv is not None: print(df_csv)

# Clean up
if os.path.exists('test.xlsx'): os.remove('test.xlsx')
if os.path.exists('test.csv'): os.remove('test.csv')

Output:

Reading as .xlsx: test.xlsx
   A
0  1
Reading as .csv: test.csv
   B
0  2

Common Cause 3: Mislabeled File Extensions

Occasionally, a file might have an Excel extension (e.g., .xlsx) but actually contain data in a different format (like CSV or plain text), or it might be corrupted. In such cases, pd.read_excel() will fail to determine the format.

Solution: Verify the true file type. You might need to open it with a text editor or the supposed original application to check its actual content structure. If it's a CSV mislabeled as .xlsx, rename it to .csv and use pd.read_csv().

Specifying the Correct Engine for `pd.read_excel()`

Even if Pandas could determine the format from the extension (e.g., .xlsx), it sometimes requires you to explicitly state the engine, especially if multiple engines could potentially handle the file or if the default choice isn't working.

For `.xlsx` files: `engine='openpyxl'` (Requires `openpyxl` installation)

import pandas as pd
# Assuming 'my_modern_excel_file.xlsx' exists and is valid
df = pd.read_excel('my_modern_excel_file.xlsx', engine='openpyxl')
print(df)

For `.xls` files: `engine='xlrd'` (Requires `xlrd` installation)

import pandas as pd

# Assuming 'my_old_excel_file.xls' exists and is valid
df = pd.read_excel('my_old_excel_file.xls', engine='xlrd')
print(df)

Installing Necessary Excel Engines (`openpyxl`, `xlrd`)

If you haven't installed the required engines, Pandas cannot use them. Install them using pip:

# For .xlsx files (modern Excel)
pip install openpyxl pandas

# For .xls files (older Excel 97-2003 format)
pip install xlrd pandas

If you use Anaconda, you can use conda install openpyxl or conda install xlrd.

Conclusion

The ValueError: Excel file format cannot be determined, you must specify an engine manually typically signals one of a few issues:

Interference from Temporary Files: Your script is trying to read temporary OS lock files (like ~$filename.xlsx). Filter these out.
Incorrect Pandas Function: You're using pd.read_excel() for a non-Excel file (e.g., a CSV) or vice-versa. Use the correct pd.read_csv() or pd.read_excel().
Mislabeled File Extension: The file extension doesn't match its actual content format. Verify and correct.
Missing or Incorrect Engine: For pd.read_excel(), explicitly specify engine='openpyxl' for .xlsx files or engine='xlrd' for .xls files, and ensure these libraries are installed.

By systematically checking these points, you can resolve this error and successfully load your Excel data into Pandas DataFrames.

Understanding the Error: Pandas and Excel Engines​

Common Cause 1: Temporary Lock Files Created by Operating Systems​

How Lock Files Cause the Error​

Solution: Ignoring Temporary Lock Files (e.g., ~$filename.xlsx)​

Common Cause 2: Using the Wrong Pandas Read Function (e.g., read_excel for CSV)​

The Importance of Matching Function to File Type​

Solution: Use pd.read_csv() for .csv and pd.read_excel() for .xls/.xlsx​

Programmatically Checking File Extensions​

Common Cause 3: Mislabeled File Extensions​

Specifying the Correct Engine for pd.read_excel()​

For .xlsx files: engine='openpyxl' (Requires openpyxl installation)​

For .xls files: engine='xlrd' (Requires xlrd installation)​

Installing Necessary Excel Engines (openpyxl, xlrd)​

Conclusion​

Table of Contents