Python Pandas: How to Update DataFrame Rows Conditionally During Iteration
While Pandas excels at vectorized operations for high performance, there are times when you might find yourself needing to iterate over DataFrame rows and update values based on specific conditions within each row. This could be due to complex row-dependent logic, for learning purposes, or when working with smaller DataFrames where performance differences are negligible.
This guide will thoroughly demonstrate several common methods for iterating through Pandas DataFrame rows—using iterrows()
, df.index
, and itertuples()
—and how to conditionally update cell values during this iteration using df.at[]
. We will also explore the apply()
method as a more idiomatic (though still row-wise) alternative for certain conditional updates.
Why Iterate and Update? (And a Word of Caution)
Iterating over DataFrame rows to perform updates is generally not recommended for large DataFrames due to significantly poorer performance compared to vectorized operations (e.g., using df.loc
with boolean indexing). Vectorized operations apply an action to a whole array or Series at once, leveraging underlying C implementations for speed.
However, you might choose to iterate if:
- The update logic for each row is very complex and difficult to vectorize.
- You are working with very small DataFrames where the performance overhead is acceptable.
- You are learning Pandas and want to understand row-by-row processing (though it's crucial to learn vectorized methods soon after).
This guide shows how to do it, but always consider if a vectorized alternative exists first.
Let's start with a sample DataFrame:
import pandas as pd
df = pd.DataFrame({
'employee_name': ['Alice Wonderland', 'Robert Tables', 'Charles Xavier', 'Diana Prince'],
'department': ['HR', 'IT', 'Management', 'Marketing'],
'salary': [70000, 85000, 120000, 75000],
'bonus_eligible': [True, False, True, True]
})
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
employee_name department salary bonus_eligible
0 Alice Wonderland HR 70000 True
1 Robert Tables IT 85000 False
2 Charles Xavier Management 120000 True
3 Diana Prince Marketing 75000 True
Method 1: Using DataFrame.iterrows()
The DataFrame.iterrows()
method yields (index, Series) pairs, allowing you to access row data as a Pandas Series. For updating, it's best to use df.at[index, column_label]
for fast scalar setting.
Basic Conditional Update
Let's increase the salary by 10% for HR employees.
import pandas as pd
df = pd.DataFrame({
'employee_name': ['Alice Wonderland', 'Robert Tables', 'Charles Xavier', 'Diana Prince'],
'department': ['HR', 'IT', 'Management', 'Marketing'],
'salary': [70000, 85000, 120000, 75000],
'bonus_eligible': [True, False, True, True]
})
df_iterrows = df.copy()
for index, row_series in df_iterrows.iterrows():
if row_series['department'] == 'HR':
# Increase salary by 10%
df_iterrows.at[index, 'salary'] = row_series['salary'] * 1.10
print("DataFrame after iterrows() update (HR salary increase):")
print(df_iterrows)
Output:
DataFrame after iterrows() update (HR salary increase):
employee_name department salary bonus_eligible
0 Alice Wonderland HR 77000 True
1 Robert Tables IT 85000 False
2 Charles Xavier Management 120000 True
3 Diana Prince Marketing 75000 True
df.iterrows()
: Yieldsindex
(the row label) androw_series
(the row data as a Series).df.at[index, 'salary']
: Efficiently accesses and sets a single value at the specifiedindex
andsalary
column.
Updating Based on Multiple Conditions (&
and |
)
Update salary to a fixed 90000 if department is 'IT' AND bonus_eligible
is False
.
import pandas as pd
df = pd.DataFrame({
'employee_name': ['Alice Wonderland', 'Robert Tables', 'Charles Xavier', 'Diana Prince'],
'department': ['HR', 'IT', 'Management', 'Marketing'],
'salary': [70000, 85000, 120000, 75000],
'bonus_eligible': [True, False, True, True]
})
df_iterrows_multi = df.copy()
for index, row_series in df_iterrows_multi.iterrows():
# Use parentheses for each condition with bitwise operators & (AND), | (OR)
if (row_series['department'] == 'IT') & (row_series['bonus_eligible'] == False):
df_iterrows_multi.at[index, 'salary'] = 90000
print("DataFrame after iterrows() (IT and not bonus_eligible salary to 90k):")
print(df_iterrows_multi)
Output:
DataFrame after iterrows() (IT and not bonus_eligible salary to 90k):
employee_name department salary bonus_eligible
0 Alice Wonderland HR 70000 True
1 Robert Tables IT 90000 False
2 Charles Xavier Management 120000 True
3 Diana Prince Marketing 75000 True
To use OR, replace &
with |
. For example: if (condition1) | (condition2):
Method 2: Iterating with DataFrame.index
You can iterate directly over the DataFrame's index labels and use df.at[index_label, column_label]
to access and modify values.
import pandas as pd
df = pd.DataFrame({
'employee_name': ['Alice Wonderland', 'Robert Tables', 'Charles Xavier', 'Diana Prince'],
'department': ['HR', 'IT', 'Management', 'Marketing'],
'salary': [70000, 85000, 120000, 75000],
'bonus_eligible': [True, False, True, True]
})
df_index_iter = df.copy()
for idx_label in df_index_iter.index:
if df_index_iter.at[idx_label, 'salary'] < 80000:
df_index_iter.at[idx_label, 'salary'] = df_index_iter.at[idx_label, 'salary'] + 5000 # Give a 5k raise
print("DataFrame after iterating with df.index (salaries < 80k get 5k raise):")
print(df_index_iter)
Output:
DataFrame after iterating with df.index (salaries < 80k get 5k raise):
employee_name department salary bonus_eligible
0 Alice Wonderland HR 75000 True
1 Robert Tables IT 85000 False
2 Charles Xavier Management 120000 True
3 Diana Prince Marketing 80000 True
This method requires you to fetch the value using df.at[]
for the condition check as well, which might be slightly less readable than iterrows()
where row_series
is directly available.
Method 3: Using DataFrame.itertuples()
(A Faster Iteration Method)
DataFrame.itertuples()
iterates over DataFrame rows as namedtuples. This is generally faster than iterrows()
because it avoids creating a Series object for each row. The first element of the tuple is the Index
, and subsequent elements are the column values.
import pandas as pd
df = pd.DataFrame({
'employee_name': ['Alice Wonderland', 'Robert Tables', 'Charles Xavier', 'Diana Prince'],
'department': ['HR', 'IT', 'Management', 'Marketing'],
'salary': [70000, 85000, 120000, 75000],
'bonus_eligible': [True, False, True, True]
})
df_itertuples = df.copy()
# By default, index=True, so row.Index gives the index label.
# name=None ensures default field names like _1, _2 for columns with invalid Python identifiers.
for row_tuple in df_itertuples.itertuples(index=True, name='EmployeeRow'):
# Access columns by their names (if valid identifiers) or by field name (e.g., row_tuple.salary)
# The index is available as row_tuple.Index
if row_tuple.department == 'Marketing' and row_tuple.salary < 100000:
df_itertuples.at[row_tuple.Index, 'salary'] = row_tuple.salary * 1.05 # 5% raise for Marketing
print("DataFrame after itertuples() update (Marketing salary increase):")
print(df_itertuples)
Output:
DataFrame after itertuples() update (Marketing salary increase):
employee_name department salary bonus_eligible
0 Alice Wonderland HR 70000 True
1 Robert Tables IT 85000 False
2 Charles Xavier Management 120000 True
3 Diana Prince Marketing 78750 True
row_tuple.Index
: Accesses the index of the current row.row_tuple.column_name
: Accesses the value ofcolumn_name
for the current row.
Method 4: A More Pandas-Idiomatic Row-wise Approach: DataFrame.apply(axis=1)
While df.apply(func, axis=1)
also processes row by row, it's often used for more complex transformations where you return a new Series or modify an existing one based on row data. It's generally more idiomatic for conditional assignments to an entire column based on other columns in that row, rather than targeted cell updates within a loop.
Let's set bonus_eligible
to False
if salary is >= 100000, otherwise keep its original value.
This is an assignment to 'bonus_eligible', not an in-place update of various cells.
import pandas as pd
df = pd.DataFrame({
'employee_name': ['Alice Wonderland', 'Robert Tables', 'Charles Xavier', 'Diana Prince'],
'department': ['HR', 'IT', 'Management', 'Marketing'],
'salary': [70000, 85000, 120000, 75000],
'bonus_eligible': [True, False, True, True]
})
df_apply = df.copy()
def determine_bonus_eligibility(row_series):
if row_series['salary'] >= 100000:
return False # Not eligible
return row_series['bonus_eligible'] # Keep original eligibility
df_apply['bonus_eligible_updated'] = df_apply.apply(determine_bonus_eligibility, axis=1)
print("DataFrame after apply() to determine new bonus eligibility:")
print(df_apply)
print()
# A more common use of apply for conditional new value for a column:
# If you wanted to set salary to 0 if not bonus_eligible:
df_apply['salary_adjusted'] = df_apply.apply(
lambda row: 0 if not row['bonus_eligible'] else row['salary'],
axis=1
)
print("DataFrame with salary adjusted by apply():")
print(df_apply[['employee_name', 'bonus_eligible', 'salary', 'salary_adjusted']])
Output:
DataFrame after apply() to determine new bonus eligibility:
employee_name department salary bonus_eligible \
0 Alice Wonderland HR 70000 True
1 Robert Tables IT 85000 False
2 Charles Xavier Management 120000 True
3 Diana Prince Marketing 75000 True
bonus_eligible_updated
0 True
1 False
2 False
3 True
DataFrame with salary adjusted by apply():
employee_name bonus_eligible salary salary_adjusted
0 Alice Wonderland True 70000 70000
1 Robert Tables False 85000 0
2 Charles Xavier True 120000 120000
3 Diana Prince True 75000 75000
Using apply()
for what could be a simple df.loc
assignment is less efficient. For instance, the last example is better done with:
df.loc[df['bonus_eligible'] == False, 'salary_adjusted_loc'] = 0
df.loc[df['bonus_eligible'] == True, 'salary_adjusted_loc'] = df['salary']
Important: Performance Considerations for Large DataFrames
Explicit iteration (iterrows
, itertuples
, for idx in df.index
) is significantly slower than vectorized Pandas operations for large DataFrames.
Whenever possible, try to achieve your updates using:
- Boolean Indexing with
.loc
:# Example: Give 10% raise to HR department
df.loc[df['department'] == 'HR', 'salary'] *= 1.10 np.select()
ornp.where()
: For more complex conditional logic across columns..map()
or.replace()
: For value transformations based on mappings.
These methods operate on entire arrays/Series at once and are much more efficient.
Conclusion
While direct iteration and conditional updates in Pandas using methods like iterrows()
, itertuples()
, or looping through df.index
are possible, they should be approached with caution, especially for larger datasets due to performance implications. For these iteration methods, df.at[index_label, column_label]
is the preferred way to set values.
The DataFrame.apply(axis=1)
method provides another row-wise processing capability, often more suited for creating new columns or transforming existing ones based on row-level logic.
However, always strive to use vectorized operations (.loc
with boolean masks, np.select
, etc.) whenever your update logic allows, as they offer substantial performance benefits. Understanding both iterative and vectorized approaches will make you a more effective Pandas user.