Python Pandas: How to Find First/Last Non-NaN Value and Index in Series/DataFrame Column
When working with Pandas DataFrames or Series that contain missing data (represented as NaN
, Not a Number, or NaT
, Not a Time), a common task is to identify the first or last valid (non-missing) entry. This is useful for data cleaning, imputation, or understanding the bounds of valid data within a sequence.
This guide explains how to find the index and the value of the first and last non-NaN entries in a Pandas Series or for each column in a DataFrame, using methods like first_valid_index()
, last_valid_index()
, and notna().idxmax()
.
The Goal: Locating First/Last Valid Data Points
Given a Pandas Series or a DataFrame, we want to:
- Find the index label of the first row that contains a non-
NaN
value. - Find the index label of the last row that contains a non-
NaN
value. - Retrieve the actual non-
NaN
values at these positions. This can be done for a single Series or applied column-wise to a DataFrame.
Example DataFrame and Series with NaN
Values
import pandas as pd
import numpy as np
# Example Series
s = pd.Series([np.nan, np.nan, 'Apple', 'Banana', np.nan, 'Cherry', np.nan],
index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
print("Original Series (s):")
print(s)
print()
# Example DataFrame
data = {
'Col_A': [np.nan, 10.0, np.nan, 15.5, 20.0],
'Col_B': [np.nan, np.nan, 'X', 'Y', np.nan],
'Col_C': [np.nan, np.nan, np.nan, np.nan, np.nan], # All NaN
'Col_D': [1, 2, 3, 4, 5] # No NaN
}
df = pd.DataFrame(data, index=['R1', 'R2', 'R3', 'R4', 'R5'])
print("Original DataFrame (df):")
print(df)
Output:
Original Series (s):
a NaN
b NaN
c Apple
d Banana
e NaN
f Cherry
g NaN
dtype: object
Original DataFrame (df):
Col_A Col_B Col_C Col_D
R1 NaN NaN NaN 1
R2 10.0 NaN NaN 2
R3 NaN X NaN 3
R4 15.5 Y NaN 4
R5 20.0 NaN NaN 5
Finding First/Last Non-NaN in a Pandas SERIES
Using Series.first_valid_index()
This method returns the index label of the first non-NaN
/NaT
value in the Series. It returns None
if all values are NaN
/NaT
.
import pandas as pd
import numpy as np
s = pd.Series([np.nan, np.nan, 'Apple', 'Banana', np.nan, 'Cherry', np.nan],
index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
# ✅ Find the index of the first valid (non-NaN) entry in Series s
first_valid_idx_s = s.first_valid_index()
print(f"Index of first valid value in Series s: '{first_valid_idx_s}'")
Output:
Index of first valid value in Series s: 'c'
Using Series.last_valid_index()
This method returns the index label of the last non-NaN
/NaT
value in the Series. It returns None
if all values are NaN
/NaT
.
import pandas as pd
import numpy as np
s = pd.Series([np.nan, np.nan, 'Apple', 'Banana', np.nan, 'Cherry', np.nan],
index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
# ✅ Find the index of the last valid (non-NaN) entry in Series s
last_valid_idx_s = s.last_valid_index()
print(f"Index of last valid value in Series s: '{last_valid_idx_s}'")
Output:
Index of last valid value in Series s: 'f'
Getting the Values at These Indices
Once you have the index labels, use standard Series indexing (s[index_label]
) to get the values.
import pandas as pd
import numpy as np
s = pd.Series([np.nan, np.nan, 'Apple', 'Banana', np.nan, 'Cherry', np.nan],
index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
first_valid_idx_s = s.first_valid_index()
last_valid_idx_s = s.last_valid_index()
if first_valid_idx_s is not None:
first_valid_value_s = s[first_valid_idx_s]
print(f"First valid value in Series s: '{first_valid_value_s}'")
else:
print("Series s has no valid values.")
if last_valid_idx_s is not None:
last_valid_value_s = s[last_valid_idx_s]
print(f"Last valid value in Series s: '{last_valid_value_s}'")
else:
print("Series s has no valid values.")
Output:
First valid value in Series s: 'Apple'
Last valid value in Series s: 'Cherry'
Finding First/Last Non-NaN for EACH COLUMN in a DataFrame
Using DataFrame.apply()
with Series.first_valid_index()
/ last_valid_index()
(Recommended)
Apply the Series methods to each column of the DataFrame.
import pandas as pd
import numpy as np
df_example = pd.DataFrame({
'Col_A': [np.nan, 10.0, np.nan, 15.5, 20.0],
'Col_B': [np.nan, np.nan, 'X', 'Y', np.nan],
'Col_C': [np.nan, np.nan, np.nan, np.nan, np.nan],
'Col_D': [1, 2, 3, 4, 5]
}, index=['R1', 'R2', 'R3', 'R4', 'R5'])
# ✅ Get index of first valid value for each column
first_valid_indices_df = df_example.apply(pd.Series.first_valid_index)
print("Index of first valid value for each column:")
print(first_valid_indices_df)
print()
# ✅ Get index of last valid value for each column
last_valid_indices_df = df_example.apply(pd.Series.last_valid_index)
print("Index of last valid value for each column:")
print(last_valid_indices_df)
Output:
Index of first valid value for each column:
Col_A R2
Col_B R3
Col_C None
Col_D R1
dtype: object
Index of last valid value for each column:
Col_A R5
Col_B R4
Col_C None
Col_D R5
dtype: object
The result is a Series where the index is the column names and values are the first/last valid index labels for those columns.
Using DataFrame.notna().idxmax()
and a Reverse Scan for Last
df.notna()
: Returns a boolean DataFrame (True where not NaN)..idxmax()
: For each column, returns the index label of the firstTrue
value. This directly gives the index of the first non-NaN value.- To find the last non-NaN, you can reverse the boolean DataFrame (
[::-1]
), find theidxmax()
of the reversed frame, which gives the reversed index of the lastTrue
. This is slightly more convoluted thanlast_valid_index()
.
import pandas as pd
import numpy as np
df_example = pd.DataFrame({
'Col_A': [np.nan, 10.0, np.nan, 15.5, 20.0],
'Col_B': [np.nan, np.nan, 'X', 'Y', np.nan],
'Col_C': [np.nan, np.nan, np.nan, np.nan, np.nan],
'Col_D': [1, 2, 3, 4, 5]
}, index=['R1', 'R2', 'R3', 'R4', 'R5'])
# ✅ Index of first non-NaN using notna() and idxmax()
# idxmax will raise ValueError if a column is all NaN
first_valid_idxmax = {}
for col in df_example.columns:
if df_example[col].notna().any(): # Check if there's any non-NaN value
first_valid_idxmax[col] = df_example[col].notna().idxmax()
else:
first_valid_idxmax[col] = None
first_valid_idxmax_series = pd.Series(first_valid_idxmax)
print("Index of first valid value (notna().idxmax()):")
print(first_valid_idxmax_series)
print()
# Index of last non-NaN using a reversed scan with notna() and idxmax()
# This is less direct than .last_valid_index()
last_valid_idxmax_reversed = {}
for col in df_example.columns:
notna_series = df_example[col].notna()
if notna_series.any():
# Get boolean series, reverse it, find first True in reversed,
# then map that reversed position back to original index.
# idxmax() on a boolean Series returns the index of the first True.
# Reversing the boolean series and then finding the first True
# is equivalent to finding the last True in the original order.
last_valid_idxmax_reversed[col] = notna_series[::-1].idxmax()
else:
last_valid_idxmax_reversed[col] = None
last_valid_idxmax_reversed_series = pd.Series(last_valid_idxmax_reversed)
print("Index of last valid value (reversed notna().idxmax()):")
print(last_valid_idxmax_reversed_series)
Output:
Index of first valid value (notna().idxmax()):
Col_A R2
Col_B R3
Col_C None
Col_D R1
dtype: object
Index of last valid value (reversed notna().idxmax()):
Col_A R5
Col_B R4
Col_C None
Col_D R5
dtype: object
idxmax()
applied to an all-False
boolean Series (which happens if a column is all NaN
) will raise a ValueError
. The if df[col].notna().any():
check handles this. df.apply(pd.Series.first_valid_index)
is generally cleaner.
Getting the Values at These Indices per Column
Once you have the Series of first/last valid indices for each column, you can retrieve the actual values.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Col_A': [np.nan, 10.0, np.nan, 15.5, 20.0],
'Col_B': [np.nan, np.nan, 'X', 'Y', np.nan],
'Col_C': [np.nan, np.nan, np.nan, np.nan, np.nan],
'Col_D': [1, 2, 3, 4, 5]
}, index=['R1', 'R2', 'R3', 'R4', 'R5'])
first_valid_indices_df = df.apply(pd.Series.first_valid_index)
last_valid_indices_df = df.apply(pd.Series.last_valid_index)
first_values = pd.Series(index=df.columns, dtype=object)
for col_name, idx_label in first_valid_indices_df.items():
if pd.notna(idx_label): # Check if an index was found (not None)
first_values[col_name] = df.loc[idx_label, col_name]
else:
first_values[col_name] = np.nan # Or None
print("First valid values for each column:")
print(first_values)
print()
last_values = pd.Series(index=df.columns, dtype=object)
for col_name, idx_label in last_valid_indices_df.items():
if pd.notna(idx_label):
last_values[col_name] = df.loc[idx_label, col_name]
else:
last_values[col_name] = np.nan
print("Last valid values for each column:")
print(last_values)
Output:
First valid values for each column:
Col_A 10.0
Col_B X
Col_C NaN
Col_D 1
dtype: object
Last valid values for each column:
Col_A 20.0
Col_B Y
Col_C NaN
Col_D 5
dtype: object
Handling Columns with All NaN
Values
Series.first_valid_index()
andSeries.last_valid_index()
will returnNone
if a Series (column) contains onlyNaN
values.DataFrame.notna().idxmax()
will raise aValueError
if a column is allNaN
(because there's noTrue
value to find the index of). You need to handle this with a preliminary check likeif df[col].notna().any():
.
The df.apply(pd.Series.first_valid_index)
approach gracefully returns None
for all-NaN columns, which is often easier to work with.
Conclusion
Pandas provides convenient methods for finding the first and last non-NaN
(valid) entries:
- For a single Series:
your_series.first_valid_index()
gives the index label of the first valid value.your_series.last_valid_index()
gives the index label of the last valid value.- Both return
None
if the Series is allNaN
or empty.
- For each column in a DataFrame:
df.apply(pd.Series.first_valid_index)
anddf.apply(pd.Series.last_valid_index)
are the recommended ways to get a Series of these indices (column names as index, first/last valid index labels as values).df.notna().idxmax()
can find the first valid index per column but requires care for all-NaN columns. A similar reversed approach can find the last.
- Once you have the index label(s), use
df.loc[index_label, column_name]
orseries[index_label]
to get the actual value(s).
These methods are essential for data cleaning, identifying data boundaries, and preprocessing steps in time series or sequential data analysis.