Skip to main content

python-pandas-how-to-query-column-name-with-spaces

Python Pandas: How to Query Column Names with Spaces (DataFrame.query)

When working with Pandas DataFrames, you often need to filter rows based on conditions applied to column values. The DataFrame.query() method provides a convenient and expressive way to do this using a string-based query expression. However, a common challenge arises when column names themselves contain spaces.

This guide explains how to correctly query DataFrame columns that have spaces in their names using backticks, and also touches upon alternative methods like boolean indexing.

The Challenge: Spaces in Column Names

Standard Python variable naming conventions discourage spaces, but real-world datasets (e.g., from CSVs or Excel files) often have column headers containing spaces, like "First Name" or "Net Salary". When using DataFrame.query(), these spaces would normally break the query string if not handled correctly.

This exmaple would cause an error because of the space in First Name:

import pandas as pd

data = {
'First Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Gross Salary': [50000, 60000, 75000, 90000]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

try:
df.query("First Name == 'Alice'") # Incorrect without backticks
except Exception as e:
print(f"Error without backticks: {e}")

Output:

Original DataFrame:
First Name Age Gross Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 75000
3 David 40 90000
Error without backticks: invalid syntax (<unknown>, line 1)

Solution: Using Backticks (`) in DataFrame.query()

Starting with Pandas version 0.25, DataFrame.query() (and DataFrame.eval()) supports the use of backticks (`) to quote column names that are not valid Python identifiers (e.g., those containing spaces, special characters, or starting with a number). The backtick character is typically found above the Tab key on most US keyboards (to the left of the 1 key).

Basic Syntax

Wrap the column name containing spaces in backticks within the query string.

`Column Name With Spaces`

Querying with String Values

import pandas as pd

data = {
'Full Name': ['Alice Smith', 'Bob Johnson', 'Charlie Brown', 'Alice Wonderland'],
'Department': ['HR', 'Engineering', 'HR', 'Engineering'],
'Years Of Service': [3, 5, 2, 7]
}
df = pd.DataFrame(data)

# ✅ Query using backticks for column name with spaces
query_result_string = df.query("`Full Name` == 'Alice Smith'")
print("Querying for 'Alice Smith':")
print(query_result_string)
# Output:
# Querying for 'Alice Smith':
# Full Name Department Years Of Service
# 0 Alice Smith HR 3

# Query for multiple conditions
query_alice_eng = df.query("`Full Name`.str.startswith('Alice') and Department == 'Engineering'")
print("Querying for Alice in Engineering:")
print(query_alice_eng)
# Querying for Alice in Engineering:
# Full Name Department Years Of Service
# 3 Alice Wonderland Engineering 7
  • "Full Name == 'Alice Smith'": The entire query is a string. Full Name is wrapped in backticks. 'Alice Smith' is a string literal within the query, so it's enclosed in single quotes (or double if the main query string uses single quotes).
  • You can use string methods like .str.startswith() on the backticked column name.

Querying with Numeric Values

Numeric values in the query do not need to be quoted within the query string.

import pandas as pd

data = {
'Employee ID': ['E101', 'E102', 'E103', 'E104'],
'Age Bracket': [30, 40, 30, 50],
'Annual Bonus': [5000, 7500, 6000, 5000]
}
df = pd.DataFrame(data)

# ✅ Query using backticks for column names with spaces, numeric condition
query_result_numeric = df.query("`Annual Bonus` > 5000 and `Age Bracket` == 30")
print("Querying for Bonus > 5000 and Age Bracket == 30:")
print(query_result_numeric)
# Querying for Bonus > 5000 and Age Bracket == 30:
# Employee ID Age Bracket Annual Bonus
# 2 E103 30 6000

Understanding DataFrame.query()

The DataFrame.query(expr, inplace=False, **kwargs) method evaluates a boolean expression string expr against the DataFrame's columns. It returns a new DataFrame containing rows for which the expression is true.

  • It uses numexpr as a backend by default for performance if installed, otherwise, it falls back to Python.
  • It allows referencing local variables in the query string by prefixing them with @ (e.g., df.query("Age > @min_age")).

Alternative for Older Pandas Versions (or General Use): Boolean Indexing

Before Pandas 0.25, or if you prefer a different style, boolean indexing is the standard way to filter DataFrames. This method works regardless of spaces in column names as long as you use bracket notation df['Column Name With Spaces'] to access the column.

Using DataFrame.loc

df.loc[] is primarily label-based but can also accept a boolean array/Series.

import pandas as pd

data = {
'First Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
'Age': [30, 40, 50, 60],
'Net Salary': [75000, 50000, 100000, 150000]
}
df = pd.DataFrame(data)

# ✅ Boolean indexing using .loc
result_loc = df.loc[df['Net Salary'] > 60000]
print("Using .loc for Net Salary > 60000:")
print(result_loc)
# Using .loc for Net Salary > 60000:
# First Name Age Net Salary
# 0 Alice 30 75000
# 2 Charlie 50 100000
# 3 Alice 60 150000
  • df['Net Salary']: Accesses the column (a Pandas Series).
  • df['Net Salary'] > 60000: Creates a boolean Series (True where salary > 60000, False otherwise).
  • df.loc[...]: Uses this boolean Series to select rows.

Direct Boolean Indexing

You can often omit .loc for simple boolean row selection.

import pandas as pd

data = {
'First Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
'Age': [30, 40, 50, 60],
'Net Salary': [75000, 50000, 100000, 150000]
}
df = pd.DataFrame(data)

# ✅ Direct boolean indexing
result_direct = df[df['Net Salary'] > 60000]
print("Using direct boolean indexing for Net Salary > 60000:")
print(result_direct)
# Using direct boolean indexing for Net Salary > 60000:
# First Name Age Net Salary
# 0 Alice 30 75000
# 2 Charlie 50 100000
# 3 Alice 60 150000

Choosing Between query() and Boolean Indexing

  • DataFrame.query():
    • Pros: Can be more readable for complex conditions due to its string-based nature. May offer performance benefits for large DataFrames if numexpr is used. Allows referencing local variables with @.
    • Cons: Requires backticks for names with spaces. Might be slightly slower for very simple queries than direct boolean indexing. Less flexible for programmatic query construction compared to building boolean Series.
  • Boolean Indexing (df[...] or df.loc[...]):
    • Pros: Standard Python syntax for conditions. Works seamlessly with column names containing spaces (using df['Column Name']). Very flexible for building complex boolean masks using & (and), | (or), ~ (not).
    • Cons: Can become less readable with many nested conditions compared to a query string.

For column names with spaces, if you choose to use query(), backticks are essential. Boolean indexing provides a robust alternative that naturally handles such column names.

Conclusion

When using Pandas DataFrame.query() with column names that contain spaces, the solution is to wrap those column names in backticks (`) within the query string (e.g., df.query("`My Column Name` > 10")). This feature, available since Pandas 0.25, allows for expressive string-based queries even with non-standard column names.

As an alternative, boolean indexing (e.g., df[df['My Column Name'] > 10]) provides a powerful and standard Pythonic way to filter DataFrames and inherently supports column names with spaces when accessed using bracket notation. Choose the method that best suits your readability preferences and the complexity of your query.