Skip to main content

Python Pandas: Read Specific Columns from Excel File (read_excel, usecols)

When working with Excel files in Pandas using pd.read_excel(), you often don't need to load all columns from the spreadsheet, especially if the file is large or contains many irrelevant fields. Pandas provides the versatile usecols parameter to specify exactly which columns you want to read, allowing you to select columns by their letter identifiers (A, B, C...), integer positions (0, 1, 2...), or their actual names from the header row.

This guide explains how to effectively use the usecols parameter to read specific columns from an Excel file into a Pandas DataFrame.

Why Read Specific Columns?

  • Memory Efficiency: Loading only necessary columns reduces the memory footprint of your DataFrame, crucial for large Excel files.
  • Performance: Reading fewer columns can speed up the data loading process.
  • Focus: Simplifies the DataFrame by including only relevant data for your analysis.
  • Data Cleaning: Avoids loading columns that might have problematic data types or require extensive cleaning if they are not needed.

Prerequisites: Installing openpyxl (or other engines)

To read .xlsx files (the modern Excel format), Pandas typically uses the openpyxl library as its engine. If you need to read older .xls files, xlrd is often used. Ensure you have the necessary engine installed:

pip install pandas openpyxl
# For .xls support, you might also need xlrd:
pip install xlrd
note

Remember to use pip3 or python -m pip if that's standard for your environment.

Example Excel File (data.xlsx)

Assume we have an Excel file named data.xlsx with the following content:

first_name (A)last_name (B)date (C)age (D)experience (E)salary (F)
AliceSmith2023-01-1529255000
TomNolan2023-03-2030362000
CarlJones2022-07-1031468000
DianaAdams2023-05-0132575000

(Column letters A-F are shown for reference).

Selecting Columns by Excel Letters or Integer Positions

The usecols parameter in pd.read_excel(filepath_or_buffer, usecols=...) can accept various formats for specifying columns.

Using a Comma-Separated String of Letters (e.g., "A,C,E")

You can provide a string of Excel column letters, separated by commas.

import pandas as pd

file_path = 'data.xlsx' # Path to your Excel file

# ✅ Select columns A (first_name), C (date), and E (experience)
df_letters = pd.read_excel(file_path, usecols="A,C,E")

print("DataFrame with columns A, C, E (by letter):")
print(df_letters)

Output (column names are taken from the header row of the Excel file):

  first_name        date  experience
0 Alice 2023-01-15 2
1 Tom 2023-03-20 3
2 Carl 2022-07-10 4
3 Diana 2023-05-01 5

Using a List of Integer Positions (e.g., [0, 2, 4])

Provide a list of 0-based integer column indices.

import pandas as pd
file_path = 'data.xlsx'

# ✅ Select columns at position 0 (first_name), 2 (date), and 4 (experience)
df_positions = pd.read_excel(file_path, usecols=[0, 2, 4])

print("DataFrame with columns at positions 0, 2, 4:")
print(df_positions)

Output:

  first_name        date  experience
0 Alice 2023-01-15 2
1 Tom 2023-03-20 3
2 Carl 2022-07-10 4
3 Diana 2023-05-01 5

Using Excel-Style Ranges (e.g., "A:C", "A,D:F")

You can specify ranges of columns using Excel's colon notation (e.g., "A:C" for columns A through C, inclusive). You can also combine ranges and individual columns.

import pandas as pd
file_path = 'data.xlsx'

# ✅ Select columns A through C (inclusive)
df_range_AC = pd.read_excel(file_path, usecols="A:C")
print("\nDataFrame with columns A through C (A:C):")
print(df_range_AC)
# Output:
# first_name last_name date
# 0 Alice Smith 2023-01-15
# ...

# ✅ Select columns A through B, AND column E
df_range_ABE = pd.read_excel(file_path, usecols="A:B,E")
print("\nDataFrame with columns A:B and E:")
print(df_range_ABE)
# Output:
# first_name last_name experience
# 0 Alice Smith 2
# ...

# ✅ Select multiple ranges: A through B, AND D through E
df_multi_range = pd.read_excel(file_path, usecols="A:B,D:E")
print("\nDataFrame with columns A:B and D:E:")
print(df_multi_range)
# Output:
# first_name last_name age experience
# 0 Alice Smith 29 2
# ...
note

Ranges like "A:C" are inclusive of both start and end columns.

Using Python's range() for Positional Slices

You can pass a Python range() object to usecols to select a sequence of columns by position. Remember that range(start, stop) is exclusive of stop.

import pandas as pd
file_path = 'data.xlsx'

# ✅ Select columns at position 0 up to (but not including) position 3
# This means columns 0, 1, 2 (A, B, C)
df_py_range = pd.read_excel(file_path, usecols=range(0, 3))
# Or simply range(3) if starting from 0

print("DataFrame with columns using range(0, 3):")
print(df_py_range)

Output:

DataFrame with columns using range(0, 3):
first_name last_name date
0 Alice Smith 2023-01-15
...

Selecting Columns by Their Names (Header Row)

If your Excel file has a header row, you can pass a list of column names (strings) that you want to read. This is often the most robust method as it doesn't depend on column letters or positions, which might change.

import pandas as pd
file_path = 'data.xlsx'

# ✅ Select columns by their names from the header row
columns_to_read_by_name = ['first_name', 'age', 'salary']
df_names = pd.read_excel(file_path, usecols=columns_to_read_by_name)

print("DataFrame with specific columns selected by name:")
print(df_names)

Output:

DataFrame with specific columns selected by name:
first_name age salary
0 Alice 29 55000
1 Tom 30 62000
2 Carl 31 68000
3 Diana 32 75000
  • The names in the usecols list must exactly match the column names in the Excel file's header row (case-sensitive).
  • The header row itself is determined by the header parameter of read_excel (defaults to 0, the first row).

Important Notes on usecols Behavior

  • Default: If usecols is None (the default), all columns are parsed.
  • Callable: usecols can also be a function that takes a column name string and returns True if the column should be parsed, False otherwise. This allows for more dynamic column selection logic.
  • Error Handling: If you specify column names in usecols that do not exist in the Excel file, Pandas might raise a ValueError (e.g., "Usecols do not match columns"). Ensure your specified names/letters/indices are correct.

Conclusion

The usecols parameter in pandas.read_excel() provides powerful flexibility for reading only specific columns from an Excel file, which is essential for efficient data loading and focused analysis. You can select columns:

  • By Excel letter strings: "A,C", "B:D,F"
  • By 0-based integer positions: [0, 2], range(3)
  • By column names from the header row: ['Column Name 1', 'Another Column'] (Recommended for robustness if headers exist)

Choose the method that best suits the structure of your Excel file and your selection criteria. Using column names is generally the most reliable if your Excel file has a consistent header row.