Python Pandas: How to Get Categorical Columns or List of Categories
Pandas' "categorical" data type is efficient for storing columns with a limited number of unique values. Identifying which columns in your DataFrame are categorical, or extracting the unique categories present within a specific categorical column, are common tasks in data exploration and preparation.
This guide explains how to select categorical columns from a DataFrame and how to get a list of the unique categories from a specific categorical Series (column) in Pandas.
Understanding Categorical Data in Pandas
A categorical data type in Pandas is used for columns that take on a limited, and usually fixed, number of possible values (categories). Examples include gender ('Male', 'Female', 'Other'), product types, or survey responses ('Agree'
, 'Neutral'
, 'Disagree'
). Using this dtype
can:
- Save memory compared to storing as object (string) type.
- Improve performance for some operations (e.g.,
groupby
). - Enable specific statistical methods and plotting suitable for categorical data.
Example DataFrame
Here, 'Department'
and 'EmploymentType'
are explicitly created as categorical columns.
import pandas as pd
import numpy as np # For numeric types
data = {
'EmployeeID': [101, 102, 103, 104, 105],
'Department': pd.Categorical(['Sales', 'HR', 'Engineering', 'Sales', 'HR']),
'EmploymentType': pd.Categorical(['Full-Time', 'Part-Time', 'Full-Time', 'Full-Time', 'Contractor']),
'YearsExperience': [3, 5, 2, 7, 1], # int64
'Salary': [60000.0, 55000.0, 90000.0, 65000.0, 52000.0] # float64
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print()
print("Original dtypes:")
print(df.dtypes)
print()
Output:
Original DataFrame:
EmployeeID Department EmploymentType YearsExperience Salary
0 101 Sales Full-Time 3 60000.0
1 102 HR Part-Time 5 55000.0
2 103 Engineering Full-Time 2 90000.0
3 104 Sales Full-Time 7 65000.0
4 105 HR Contractor 1 52000.0
Original dtypes:
EmployeeID int64
Department category
EmploymentType category
YearsExperience int64
Salary float64
dtype: object
Get a List of CATEGORICAL COLUMNS from a DataFrame
These methods help you identify which columns in your DataFrame have the category
dtype.
Using DataFrame.select_dtypes(include='category')
(Recommended)
The DataFrame.select_dtypes()
method returns a subset of the DataFrame's columns based on their data types.
import pandas as pd
df_example = pd.DataFrame({
'EmployeeID': [101], 'Department': pd.Categorical(['Sales']),
'EmploymentType': pd.Categorical(['Full-Time']), 'YearsExperience': [3]
})
# ✅ Select only columns with dtype 'category'
categorical_df = df_example.select_dtypes(include=['category'])
print("DataFrame containing only categorical columns:")
print(categorical_df)
print()
# To get just the names of these columns as a list:
categorical_column_names = categorical_df.columns.tolist()
print(f"List of categorical column names: {categorical_column_names}")
Output:
DataFrame containing only categorical columns:
Department EmploymentType
0 Sales Full-Time
List of categorical column names: ['Department', 'EmploymentType']
include=['category']
: Specifies that only columns ofcategory
dtype should be kept. You can pass a list of dtypes.- You can also include other types, e.g.,
'object'
if strings might also be considered categorical in some contextscategorical_and_object_df = df.select_dtypes(include=['category', 'object'])
Using select_dtypes(exclude=...)
Alternatively, you can exclude numeric or other known types to infer which ones might be categorical (or object type, which often represents categorical data before explicit conversion).
import pandas as pd
import numpy as np
df_example = pd.DataFrame({
'EmployeeID': [101], 'Department': pd.Categorical(['Sales']),
'YearsExperience': [3], 'Salary': [60000.0]
})
# Exclude numeric types to get remaining columns (which could be category, object, etc.)
non_numeric_df = df_example.select_dtypes(exclude=[np.number]) # np.number includes int and float
# Or more specific: exclude=['int64', 'float64', 'bool']
print("DataFrame excluding numeric columns:")
print(non_numeric_df)
print()
print(f"Names of non-numeric columns: {non_numeric_df.columns.tolist()}")
Output:
DataFrame excluding numeric columns:
Department
0 Sales
Names of non-numeric columns: ['Department']
Using _get_numeric_data()
and Set Difference (Less Direct)
This internal-like method selects numeric columns. You can then find the difference between all columns and numeric columns.
import pandas as pd
df_example = pd.DataFrame({
'EmployeeID': [101], 'Department': pd.Categorical(['Sales']),
'YearsExperience': [3], 'Salary': [60000.0]
})
all_cols_set = set(df_example.columns)
numeric_cols_set = set(df_example._get_numeric_data().columns)
categorical_like_cols = list(all_cols_set - numeric_cols_set)
# Or using set.difference():
# categorical_like_cols = list(all_cols_set.difference(numeric_cols_set))
# Note: This method relies on _get_numeric_data() and might not be as robust as select_dtypes
# for specifically identifying 'category' dtype. It gets non-numeric columns.
print(f"Categorical-like columns (all_cols - numeric_cols): {categorical_like_cols}")
Output:
Categorical-like columns (all_cols - numeric_cols): ['Department']
select_dtypes
is generally more explicit and recommended for selecting based on specific dtypes likecategory
.
Get a List of the Unique CATEGORIES within a SINGLE Categorical Column
Once you have a column that is of category
dtype, you can access its unique categories.
Using Series.cat.categories
(Recommended)
For a Series of category
dtype, the .cat
accessor provides access to categorical-specific attributes and methods. .cat.categories
returns an Index object containing the unique categories.
import pandas as pd
df_example = pd.DataFrame({
'Department': pd.Categorical(['Sales', 'HR', 'Engineering', 'Sales', 'HR']),
})
# Select the categorical column
department_series = df_example['Department']
# ✅ Get the unique categories
unique_departments = department_series.cat.categories
print(f"Unique categories in 'Department' column:")
print(unique_departments)
Output:
Unique categories in 'Department' column:
Index(['Engineering', 'HR', 'Sales'], dtype='object')
The order is typically sorted alphabetically by default for categories.
Converting cat.categories
to a List
The cat.categories
attribute returns an Index
object. To get a Python list, use .tolist()
.
import pandas as pd
df_example = pd.DataFrame({
'Department': pd.Categorical(['Sales', 'HR', 'Engineering', 'Sales', 'HR']),
})
# Select the categorical column
department_series = df_example['Department']
# Get the unique categories
unique_departments = department_series.cat.categories
unique_departments_list = unique_departments.tolist()
print(f"Unique departments as a list: {unique_departments_list}")
Output:
Unique departments as a list: ['Engineering', 'HR', 'Sales']
Check if a Specific Column IS Categorical
To verify if a particular column has the category
data type:
import pandas as pd
df_example = pd.DataFrame({
'Department': pd.Categorical(['Sales']), 'YearsExperience': [3]
})
# Check 'Department'
col_to_check = 'Department'
is_dept_categorical = (df_example[col_to_check].dtype.name == 'category')
print(f"Is '{col_to_check}' column categorical? {is_dept_categorical}")
# Check 'YearsExperience'
col_to_check_2 = 'YearsExperience'
is_exp_categorical = (df_example[col_to_check_2].dtype.name == 'category')
print(f"Is '{col_to_check_2}' column categorical? {is_exp_categorical}")
Output:
Is 'Department' column categorical? True
Is 'YearsExperience' column categorical? False
df['YourColumn'].dtype
: Returns the dtype object for the column..name
: Accesses the string name of that dtype (e.g.,'category'
,'int64'
,'object'
).pandas.api.types.is_categorical_dtype(series)
is another robust way to check.
Alternative using pd.api.types
:
from pandas.api.types import is_categorical_dtype
print(f"Is 'Department' categorical (api)? {is_categorical_dtype(df['Department'])}")
Conclusion
Pandas provides clear ways to work with categorical data:
- To get a list of all categorical column names in a DataFrame, the recommended method is
df.select_dtypes(include=['category']).columns.tolist()
. - To get a list of the unique categories within a specific categorical column (Series), use
your_series.cat.categories.tolist()
. - To check if a specific column is categorical, compare
df['col'].dtype.name == 'category'
or usepd.api.types.is_categorical_dtype(df['col'])
.
These methods allow you to effectively identify and utilize categorical data within your Pandas DataFrames for more efficient storage and specialized analysis.