Python Pandas: How to Change DataFrame Column Type to Categorical
In Pandas, the "categorical" data type is highly beneficial for columns that have a limited, fixed number of possible values (categories), especially if those values are strings. Using the categorical type can lead to significant memory savings and performance improvements for certain operations compared to storing such data as standard Python objects (strings).
This guide explains how to efficiently change the data type of one or more DataFrame columns to categorical using the astype()
method and related techniques.
Why Use the Categorical Data Type in Pandas?
- Memory Efficiency: If a column contains repetitive string values (e.g., "Male"/"Female", country names, product categories), storing them as categoricals is much more memory-efficient. Pandas stores each unique category once and uses integer codes to represent the values.
- Performance: Operations like grouping (
groupby()
) or sorting on categorical columns can be faster than on object (string) columns. - Semantic Meaning: It explicitly tells Pandas (and other users of your code) that the column represents a fixed set of categories.
- Enables Specific Operations: Some statistical or plotting functions work better or provide more relevant results with categorical data.
Example DataFrame:
import pandas as pd
data = {
'EmployeeID': [101, 102, 103, 104, 105],
'Department': ['Sales', 'HR', 'Engineering', 'Sales', 'HR'],
'Gender': ['Female', 'Male', 'Male', 'Female', 'Male'],
'YearsOfService': [3, 5, 2, 7, 1],
'Salary': [60000.0, 55000.0, 90000.0, 65000.0, 52000.0]
}
df = pd.DataFrame(data)
print("Original DataFrame and dtypes:")
print(df)
print()
print("Original dtypes:")
print(df.dtypes)
Output:
Original DataFrame and dtypes:
EmployeeID Department Gender YearsOfService Salary
0 101 Sales Female 3 60000.0
1 102 HR Male 5 55000.0
2 103 Engineering Male 2 90000.0
3 104 Sales Female 7 65000.0
4 105 HR Male 1 52000.0
Original dtypes:
EmployeeID int64
Department object
Gender object
YearsOfService int64
Salary float64
dtype: object
Columns like 'Department' and 'Gender' are good candidates for the categorical type.
Changing a Single Column's Type to Categorical
Using Series.astype('category')
The most common and recommended way to change a column's data type is by selecting the column (which returns a Series) and then calling the .astype()
method on it.
import pandas as pd
df = pd.DataFrame({
'EmployeeID': [101, 102, 103, 104, 105],
'Department': ['Sales', 'HR', 'Engineering', 'Sales', 'HR'],
'Gender': ['Female', 'Male', 'Male', 'Female', 'Male'],
'YearsOfService': [3, 5, 2, 7, 1],
'Salary': [60000.0, 55000.0, 90000.0, 65000.0, 52000.0]
})
# ✅ Change the 'Department' column to categorical
df['Department'] = df['Department'].astype('category')
print("DataFrame after changing 'Department' to category:")
print(df.head())
Output:
DataFrame after changing 'Department' to category:
EmployeeID Department Gender YearsOfService Salary
0 101 Sales Female 3 60000.0
1 102 HR Male 5 55000.0
2 103 Engineering Male 2 90000.0
3 104 Sales Female 7 65000.0
4 105 HR Male 1 52000.0
df['Department']
: Selects the 'Department' column as a Series..astype('category')
: Casts the Series to the categorical data type.df['Department'] = ...
: Assigns the converted Series back to the DataFrame column.
Verifying the Change with DataFrame.dtypes
The DataFrame.dtypes
attribute returns a Series showing the data type of each column.
import pandas as pd
df = pd.DataFrame({
'EmployeeID': [101, 102, 103, 104, 105],
'Department': ['Sales', 'HR', 'Engineering', 'Sales', 'HR'],
'Gender': ['Female', 'Male', 'Male', 'Female', 'Male'],
'YearsOfService': [3, 5, 2, 7, 1],
'Salary': [60000.0, 55000.0, 90000.0, 65000.0, 52000.0]
})
# ✅ Change the 'Department' column to categorical
df['Department'] = df['Department'].astype('category')
# ... (after the conversion above) ...
print("Dtypes after changing 'Department':")
print(df.dtypes)
Output:
Dtypes after changing 'Department':
EmployeeID int64
Department category
Gender object
YearsOfService int64
Salary float64
dtype: object
Changing Multiple Columns' Types to Categorical
Passing a List of Columns to astype('category')
In modern Pandas versions, you can select multiple columns and call .astype('category')
on the resulting DataFrame subset.
import pandas as pd
df = pd.DataFrame({
'EmployeeID': [101, 102, 103, 104, 105],
'Department': ['Sales', 'HR', 'Engineering', 'Sales', 'HR'],
'Gender': ['Female', 'Male', 'Male', 'Female', 'Male'],
'YearsOfService': [3, 5, 2, 7, 1],
'Salary': [60000.0, 55000.0, 90000.0, 65000.0, 52000.0]
})
columns_to_categorize = ['Department', 'Gender']
# ✅ Select multiple columns and apply astype
df[columns_to_categorize] = df[columns_to_categorize].astype('category')
print("Dtypes after changing 'Department' and 'Gender':")
print(df.dtypes)
Output:
Dtypes after changing 'Department' and 'Gender':
EmployeeID int64
Department category
Gender category
YearsOfService int64
Salary float64
dtype: object
You can also provide the list of columns directly:
df[['Department', 'Gender']] = df[['Department', 'Gender']].astype('category')
This is efficient and concise for multiple columns.
Iterating with a for
Loop (Less Common Now)
In older Pandas versions, or if you prefer explicit iteration, a for
loop was common.
import pandas as pd
df = pd.DataFrame({
'EmployeeID': [101, 102, 103, 104, 105],
'Department': ['Sales', 'HR', 'Engineering', 'Sales', 'HR'],
'Gender': ['Female', 'Male', 'Male', 'Female', 'Male'],
'YearsOfService': [3, 5, 2, 7, 1],
'Salary': [60000.0, 55000.0, 90000.0, 65000.0, 52000.0]
})
columns_to_categorize = ['Department', 'Gender']
print("Changing multiple columns using a for loop:")
for col_name in columns_to_categorize:
df[col_name] = df[col_name].astype('category')
print(df.dtypes)
Output:
Changing multiple columns using a for loop:
EmployeeID int64
Department category
Gender category
YearsOfService int64
Salary float64
dtype: object
While this works, the direct selection and astype
call (Method 3.1) is now generally preferred for its conciseness.
Using DataFrame.apply()
with a Lambda (Less Common Now)
Similarly, apply()
with a lambda could be used, but it's also less direct than Method 3.1 for simple type casting.
import pandas as pd
df = pd.DataFrame({
'EmployeeID': [101, 102, 103, 104, 105],
'Department': ['Sales', 'HR', 'Engineering', 'Sales', 'HR'],
'Gender': ['Female', 'Male', 'Male', 'Female', 'Male'],
'YearsOfService': [3, 5, 2, 7, 1],
'Salary': [60000.0, 55000.0, 90000.0, 65000.0, 52000.0]
})
columns_to_categorize = ['Department', 'Gender']
print("Changing multiple columns using apply() and lambda:")
# Apply astype to each selected column
df[columns_to_categorize] = df[columns_to_categorize].apply(lambda x: x.astype('category'))
print(df.dtypes)
Output:
Changing multiple columns using apply() and lambda:
EmployeeID int64
Department category
Gender category
YearsOfService int64
Salary float64
dtype: object
Converting All Columns (or All Except Some) to Categorical
Selecting Columns by Data Type (select_dtypes
)
If you want to convert all columns of a certain type (e.g., all object
type columns, which usually hold strings) or exclude certain types, select_dtypes()
is very useful.
import pandas as pd
data_mixed = {
'Name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'], # object
'Category': ['Electronics', 'Electronics', 'Electronics', 'Electronics'], # object
'Quantity': [10, 50, 30, 15], # int
'Price': [1200.0, 25.0, 75.0, 300.0], # float
'Status': ['In Stock', 'In Stock', 'Low Stock', 'In Stock'] # object
}
df_mixed = pd.DataFrame(data_mixed)
print("Original dtypes for mixed DataFrame:")
print(df_mixed.dtypes)
print()
# Convert all 'object' type columns (typically strings) to 'category'
object_cols = df_mixed.select_dtypes(include='object').columns
df_mixed[object_cols] = df_mixed[object_cols].astype('category')
print("Dtypes after converting all object columns to category:")
print(df_mixed.dtypes)
print()
# --- Example: Convert all columns *except* numeric types to category ---
df_mixed_reinit = pd.DataFrame(data_mixed) # Re-initialize for this example
# Select columns that are NOT int64 or float64
cols_to_convert = df_mixed_reinit.select_dtypes(exclude=['int64', 'float64']).columns
df_mixed_reinit[cols_to_convert] = df_mixed_reinit[cols_to_convert].astype('category')
print("Dtypes after converting non-numeric columns to category:")
print(df_mixed_reinit.dtypes) # Output: Same as above, as 'object' was the only non-numeric type
Output:
Original dtypes for mixed DataFrame:
Name object
Category object
Quantity int64
Price float64
Status object
dtype: object
Dtypes after converting all object columns to category:
Name category
Category category
Quantity int64
Price float64
Status category
dtype: object
Dtypes after converting non-numeric columns to category:
Name category
Category category
Quantity int64
Price float64
Status category
dtype: object
df.select_dtypes(include=...)
ordf.select_dtypes(exclude=...)
returns a subset of the DataFrame..columns
gets the column names from this subset.- Then, use these column names to select and apply
astype('category')
.
Conclusion
Changing DataFrame column types to category
in Pandas is a valuable optimization for memory and performance when dealing with columns that have a limited number of unique string values.
- For single columns:
df['col_name'] = df['col_name'].astype('category')
is standard. - For multiple specific columns:
df[list_of_cols] = df[list_of_cols].astype('category')
is concise and efficient in modern Pandas. - To convert all columns of a certain type (e.g.,
object
) or all except certain types: Usedf.select_dtypes()
to get the target column names, then applyastype('category')
.
Always verify the conversion using df.dtypes
to ensure the columns have been successfully changed to the category
data type.