Python Pandas: How to Modify a Subset of DataFrame Rows Based on Condition
Modifying specific rows in a Pandas DataFrame based on certain conditions is a fundamental part of data manipulation and cleaning. You might need to update values in one or more columns for rows that meet specific criteria (e.g., update the 'Status' for all products in a certain 'Category', or apply a discount to items with 'Price' above a threshold).
This guide explains several effective methods to modify a subset of rows in a Pandas DataFrame, primarily using DataFrame.loc
, numpy.where()
, and DataFrame.apply()
.
The Goal: Conditional Row Modification
Given a Pandas DataFrame, we want to identify a subset of rows based on a condition (or multiple conditions) and then change the values in one or more specific columns only for those selected rows.
Example DataFrame
import pandas as pd
import numpy as np # For np.where
data = {
'ProductID': [101, 102, 103, 104, 105, 106],
'Category': ['Electronics', 'Books', 'Apparel', 'Electronics', 'Books', 'Apparel'],
'Price': [299.99, 19.95, 45.50, 799.00, 12.50, 30.00],
'Stock': [10, 50, 0, 5, 120, 0],
'Status': ['Active', 'Active', 'Discontinued', 'Active', 'Active', 'Discontinued']
}
df_original = pd.DataFrame(data)
print("Original DataFrame:")
print(df_original)
Output:
Original DataFrame:
ProductID Category Price Stock Status
0 101 Electronics 299.99 10 Active
1 102 Books 19.95 50 Active
2 103 Apparel 45.50 0 Discontinued
3 104 Electronics 799.00 5 Active
4 105 Books 12.50 120 Active
5 106 Apparel 30.00 0 Discontinued
Method 1: Using DataFrame.loc
(Recommended)
The DataFrame.loc
indexer is primarily label-based but also excels at boolean array indexing, making it ideal for conditional selection and assignment. The syntax is df.loc[row_condition, column_to_modify] = new_value
.
Modifying a Single Column
Let's update the 'Status' to 'Clearance' for all 'Apparel' items with 0 'Stock'.
import pandas as pd
df = pd.DataFrame({
'ProductID': [101, 102, 103, 104, 105, 106],
'Category': ['Electronics', 'Books', 'Apparel', 'Electronics', 'Books', 'Apparel'],
'Price': [299.99, 19.95, 45.50, 799.00, 12.50, 30.00],
'Stock': [10, 50, 0, 5, 120, 0],
'Status': ['Active', 'Active', 'Discontinued', 'Active', 'Active', 'Discontinued']
})
# Define the condition for rows to select
condition = (df['Category'] == 'Apparel') & (df['Stock'] == 0)
print("Boolean mask for selection:")
print(condition)
print()
# ✅ Select rows meeting the condition and update the 'Status' column for them
df.loc[condition, 'Status'] = 'Clearance'
print("DataFrame after modifying 'Status' for Apparel with 0 stock:")
print(df)
Output:
Boolean mask for selection:
0 False
1 False
2 True
3 False
4 False
5 True
dtype: bool
DataFrame after modifying 'Status' for Apparel with 0 stock:
ProductID Category Price Stock Status
0 101 Electronics 299.99 10 Active
1 102 Books 19.95 50 Active
2 103 Apparel 45.50 0 Clearance
3 104 Electronics 799.00 5 Active
4 105 Books 12.50 120 Active
5 106 Apparel 30.00 0 Clearance
Modifying Based on Existing Values in the Target Column
You can use the current values of the column being modified on the right side of the assignment. For example, increase the 'Price' by 10% for 'Electronics'.
import pandas as pd
df = pd.DataFrame({
'ProductID': [101, 102, 103, 104, 105, 106],
'Category': ['Electronics', 'Books', 'Apparel', 'Electronics', 'Books', 'Apparel'],
'Price': [300.0, 20.0, 45.0, 800.0, 12.0, 30.0], # Simplified prices
'Status': ['Active', 'Active', 'Clearance', 'Active', 'Active', 'Clearance']
})
condition_electronics = (df['Category'] == 'Electronics')
# ✅ Increase 'Price' by 10% for rows matching the condition
df.loc[condition_electronics, 'Price'] = df.loc[condition_electronics, 'Price'] * 1.10
print("DataFrame after increasing 'Price' for Electronics by 10%:")
print(df)
Output:
DataFrame after increasing 'Price' for Electronics by 10%:
ProductID Category Price Status
0 101 Electronics 330.0 Active
1 102 Books 20.0 Active
2 103 Apparel 45.0 Clearance
3 104 Electronics 880.0 Active
4 105 Books 12.0 Active
5 106 Apparel 30.0 Clearance
It's crucial that the selection on the right side df.loc[condition_electronics, 'Price']
aligns with the rows being modified on the left.
Modifying Multiple Columns
To modify multiple columns for the selected rows, pass a list of column names. The value assigned can be a single scalar (applied to all matched cells in the specified columns) or a list/array of values that aligns with the selection.
import pandas as pd
df = pd.DataFrame({
'ProductID': [101, 102, 103],
'Category': ['Electronics', 'Books', 'Apparel'],
'Price': [330.0, 20.0, 45.0],
'Stock': [10, 50, 0],
'Status': ['Active', 'Active', 'Clearance']
})
condition_no_stock = (df['Stock'] == 0)
columns_to_update = ['Status', 'Price']
# An option is to assign different specific values to each column
# (This requires more careful construction if values differ per row)
# For simplicity, let's assign a single new status and a new price
# For complex row-specific values, np.where or apply might be better.
# Here, we'll set 'Status' to 'Sold Out' and 'Price' to 0 for 0 stock items.
df.loc[condition_no_stock, 'Status'] = 'Sold Out'
df.loc[condition_no_stock, 'Price'] = 0.00 # Update price separately or use a more complex assignment
# If assigning the *same* value across multiple columns (less common for different types):
# df.loc[condition_no_stock, ['Status', 'Note']] = "Out of Stock"
# If assigning a list of new values (must match number of selected rows * number of columns)
# Or assign a DataFrame of new values with matching index and columns
print("DataFrame after updating multiple columns for 0 stock:")
print(df)
Output:
DataFrame after updating multiple columns for 0 stock:
ProductID Category Price Stock Status
0 101 Electronics 330.0 10 Active
1 102 Books 20.0 50 Active
2 103 Apparel 0.0 0 Sold Out
Example: Modify multiple columns based on their existing values
import pandas as pd
# Sample DataFrame
df_multi_mod = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [10, 20, 30, 40],
'C': ['x', 'y', 'x', 'y']
})
# Condition where column 'C' is 'x'
condition_c_x = (df_multi_mod['C'] == 'x')
# Create a DataFrame from modified values
updated_values = df_multi_mod.loc[condition_c_x, ['A', 'B']].apply(
lambda r: pd.Series([r['A'] + 1, r['B'] * 2]), axis=1
)
# Assign the updated values back to the original DataFrame
df_multi_mod.loc[condition_c_x, ['A', 'B']] = updated_values.values
print("Modifying multiple columns based on existing values:")
print(df_multi_mod)
Output:
Modifying multiple columns based on existing values:
A B C
0 2 20 x
1 2 20 y
2 4 60 x
3 4 40 y
When assigning calculated values to multiple columns, using .apply()
on the right-hand side selection or preparing a DataFrame/array of new values can be necessary if the new values for each column are different and depend on existing data.
Method 2: Using numpy.where()
numpy.where(condition, value_if_true, value_if_false)
is useful for conditional assignments, especially when creating a new column or updating one based on a condition.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ProductID': [101, 102, 103, 104, 105, 106],
'Category': ['Electronics', 'Books', 'Apparel', 'Electronics', 'Books', 'Apparel'],
'Price': [299.99, 19.95, 45.50, 799.00, 12.50, 30.00],
'Stock': [10, 50, 0, 5, 120, 0],
'Status': ['Active', 'Active', 'Discontinued', 'Active', 'Active', 'Discontinued']
})
# Condition: Stock == 0
condition = (df['Stock'] == 0)
# ✅ Update 'Status' column using np.where()
# If condition is True, set Status to 'Sold Out', else keep original Status
df['Status'] = np.where(condition, 'Sold Out', df['Status'])
print("DataFrame after modifying 'Status' using np.where():")
print(df)
Output:
DataFrame after modifying 'Status' using np.where():
ProductID Category Price Stock Status
0 101 Electronics 299.99 10 Active
1 102 Books 19.95 50 Active
2 103 Apparel 45.50 0 Sold Out
3 104 Electronics 799.00 5 Active
4 105 Books 12.50 120 Active
5 106 Apparel 30.00 0 Sold Out
np.where
is very efficient as it leverages NumPy's vectorized operations. It's excellent for updating a single column based on a condition applied to potentially other columns.
Method 3: Using DataFrame.apply()
(Row-wise)
For more complex conditional logic that is hard to express with direct boolean indexing or np.where
, you can use DataFrame.apply(custom_function, axis=1)
. The custom function receives each row as a Series.
import pandas as pd
df_apply_example = pd.DataFrame({
'ProductID': [101, 102, 103, 104, 105, 106],
'Category': ['Electronics', 'Books', 'Apparel', 'Electronics', 'Books', 'Apparel'],
'Price': [299.99, 19.95, 45.50, 799.00, 12.50, 30.00],
'Stock': [10, 50, 0, 5, 120, 0],
'Status': ['Active', 'Active', 'Discontinued', 'Active', 'Active', 'Discontinued']
})
def update_status_based_on_stock(row):
# This function operates on a single row (passed as a Series)
if row['Stock'] == 0 and row['Category'] == 'Apparel':
return 'Clearance_Applied' # New value for 'Status'
return row['Status'] # Keep original status otherwise
# ✅ Apply the function row-wise (axis=1) to update the 'Status' column
df_apply_example['Status'] = df_apply_example.apply(update_status_based_on_stock, axis=1)
print("DataFrame after modifying 'Status' using apply():")
print(df_apply_example)
Output:
DataFrame after modifying 'Status' using apply():
ProductID Category Price Stock Status
0 101 Electronics 299.99 10 Active
1 102 Books 19.95 50 Active
2 103 Apparel 45.50 0 Clearance_Applied
3 104 Electronics 799.00 5 Active
4 105 Books 12.50 120 Active
5 106 Apparel 30.00 0 Clearance_Applied
axis=1
: Ensures the function is applied to each row.apply()
can be less performant than vectorized methods like.loc
ornp.where
for simple conditions, but offers maximum flexibility for complex row-wise logic.
Choosing the Right Method
DataFrame.loc[condition, column(s)] = value
: Generally recommended for its clarity, conciseness, and directness for most conditional updates, especially when assigning a fixed value or values derived from the same selection.numpy.where(condition, value_if_true, value_if_false)
: Excellent for updating a single column based on a condition. It's very performant.DataFrame.apply(func, axis=1)
: Best suited for complex row-by-row conditional logic that is difficult to express with boolean indexing ornp.where
. It's typically slower than vectorized approaches.
Conclusion
Modifying a subset of rows in a Pandas DataFrame based on conditions is a core data manipulation skill.
df.loc[row_condition, column_to_change] = new_value
is the most idiomatic and versatile method for most cases.np.where()
offers an efficient alternative for updating a single column conditionally.df.apply(..., axis=1)
provides maximum flexibility for complex row-wise transformations.
Choose the method that makes your code clear, readable, and performs adequately for your dataset size and the complexity of your conditions.