Python NumPy: How to Interpolate NaN Values in an Array
Missing data, often represented as NaN
(Not a Number) in NumPy arrays, can pose challenges for many numerical computations and analyses. Interpolation is a common technique to estimate these missing values based on the existing data points in the array. For 1D arrays, linear interpolation is a frequently used method where a missing value is estimated by fitting a straight line between its nearest known neighbors.
This guide will comprehensively demonstrate how to perform 1D linear interpolation of NaN
values in a NumPy array using the numpy.interp()
function. We will also explore a convenient alternative by leveraging the Series.interpolate()
method from the Pandas library for a more direct approach.
Understanding Interpolation for Missing Values (NaNs)
Interpolation is the process of estimating unknown values that fall between known data points. When dealing with NaN
s in a NumPy array, linear interpolation aims to fill these gaps by assuming a linear relationship between the valid (non-NaN) data points surrounding the NaN
.
For example, if we have [1, NaN, 3]
, linear interpolation would estimate the NaN
as 2
. If we have [1, NaN, NaN, 4]
, the two NaN
s might be estimated as 2
and 3
respectively.
Method 1: Using numpy.interp()
for 1D Linear Interpolation
The numpy.interp(x, xp, fp)
function performs one-dimensional linear interpolation. It finds the values of a function (fp
) at new points (x
) given a set of known data points (xp
, fp
), where xp
must be monotonically increasing.
How numpy.interp()
Works
x
: The x-coordinates at which to evaluate the interpolated values (in our case, the indices ofNaN
s).xp
: The x-coordinates of the known data points (indices of non-NaN values). Must be increasing.fp
: The y-coordinates (values) of the known data points, corresponding toxp
.
Step-by-Step Implementation for Interpolating NaNs
To use np.interp()
for NaN
s in a 1D array:
- Identify the indices of
NaN
values (these are ourx
points). - Identify the indices of non-NaN values (these are our
xp
points). - Get the actual non-NaN values (these are our
fp
points). - Call
np.interp()
and assign the results back to theNaN
positions in the array.
import numpy as np
array_with_nans = np.array([10.0, 12.0, np.nan, 18.0, np.nan, np.nan, 27.0, 30.0])
print(f"Original array: {array_with_nans}")
# Make a copy to avoid modifying the original array if needed elsewhere
interpolated_array_np = array_with_nans.copy()
# 1. Find indices of NaNs (our 'x' for interpolation)
nan_indices = np.isnan(interpolated_array_np).nonzero()[0]
print(f"Indices of NaN values (x): {nan_indices}")
# 2. Find indices of non-NaNs (our 'xp')
not_nan_indices = (~np.isnan(interpolated_array_np)).nonzero()[0]
print(f"Indices of non-NaN values (xp): {not_nan_indices}")
# 3. Get the actual non-NaN values (our 'fp')
known_values = interpolated_array_np[~np.isnan(interpolated_array_np)]
# or: known_values = interpolated_array_np[not_nan_indices]
print(f"Known values (fp): {known_values}")
# 4. Perform interpolation and assign back
# Only interpolate if there are known points to interpolate from
if len(known_values) > 1 and len(nan_indices) > 0 : # Need at least 2 known points for interp
interpolated_values = np.interp(nan_indices, not_nan_indices, known_values)
interpolated_array_np[nan_indices] = interpolated_values
print(f"Array after np.interp() interpolation: {interpolated_array_np}")
Output:
Original array: [10. 12. nan 18. nan nan 27. 30.]
Indices of NaN values (x): [2 4 5]
Indices of non-NaN values (xp): [0 1 3 6 7]
Known values (fp): [10. 12. 18. 27. 30.]
Array after np.interp() interpolation: [10. 12. 15. 18. 21. 24. 27. 30.]
A Reusable Function for np.interp()
import numpy as np
def interpolate_nans_with_numpy(array_like):
"""
Interpolates NaN values in a 1D NumPy array using linear interpolation.
Handles NaNs at the beginning or end by propagating the nearest valid value (extrapolation).
"""
arr = np.array(array_like, dtype=float).copy() # Ensure float for NaNs and copy
nan_mask = np.isnan(arr)
if not np.any(nan_mask): # No NaNs to interpolate
return arr
x_coords_all = np.arange(len(arr))
known_x_coords = x_coords_all[~nan_mask]
known_y_values = arr[~nan_mask]
if len(known_y_values) < 2: # Cannot interpolate with less than 2 known points
# Handle cases: all NaNs, or only one known point (fill all NaNs with it)
if len(known_y_values) == 1:
arr[nan_mask] = known_y_values[0]
return arr # Or raise error, or return original if all NaNs
nan_x_coords = x_coords_all[nan_mask]
interpolated_values = np.interp(nan_x_coords, known_x_coords, known_y_values)
arr[nan_mask] = interpolated_values
return arr
# Example usage:
test_array1 = np.array([1, 1, np.NaN, 2, 2, np.NaN, 3, 3, np.NaN])
print(f"Interpolated test_array1: {interpolate_nans_with_numpy(test_array1)}")
test_array2 = np.array([np.nan, 1, np.nan, 2, np.nan])
print(f"Interpolated test_array2: {interpolate_nans_with_numpy(test_array2)}")
Output:
Interpolated test_array1: [1. 1. 1.5 2. 2. 2.5 3. 3. 3. ]
Interpolated test_array2: [1. 1. 1.5 2. 2. ]
np.interp
also performs extrapolation for points outside the range of xp
, using the first or last fp
value. This means NaN
s at the very beginning or end of the array will be filled with the nearest valid data point.
Method 2: Using pandas.Series.interpolate()
(Convenient Alternative)
If you have Pandas installed, using Series.interpolate()
is often more straightforward as it's designed for this kind of task and handles edge cases well.
Converting NumPy Array to Pandas Series
First, convert your NumPy array to a Pandas Series.
import pandas as pd # Make sure pandas is installed: pip install pandas
import numpy as np
array_with_nans = np.array([10.0, 12.0, np.nan, 18.0, np.nan, np.nan, 27.0, 30.0])
# Convert NumPy array to Pandas Series
pd_series = pd.Series(array_with_nans)
print("Pandas Series from NumPy array:")
print(pd_series)
Output:
Pandas Series from NumPy array:
0 10.0
1 12.0
2 NaN
3 18.0
4 NaN
5 NaN
6 27.0
7 30.0
dtype: float64
Applying Series.interpolate()
The interpolate()
method on a Pandas Series fills NaN
values. By default, it uses linear interpolation.
import pandas as pd
import numpy as np
# pd_series defined as above
array_with_nans = np.array([10.0, 12.0, np.nan, 18.0, np.nan, np.nan, 27.0, 30.0])
pd_series = pd.Series(array_with_nans)
# Interpolate NaN values (default method is 'linear')
interpolated_pd_series = pd_series.interpolate(method='linear') # 'linear' is default
print("Pandas Series after .interpolate():")
print(interpolated_pd_series)
Output:
Pandas Series after .interpolate():
0 10.0
1 12.0
2 15.0
3 18.0
4 21.0
5 24.0
6 27.0
7 30.0
dtype: float64
Pandas' interpolate()
offers various methods beyond linear (e.g., 'polynomial', 'spline'), providing more advanced options if needed. It also handles NaNs at the beginning/end more flexibly with limit_direction
parameter (e.g. 'forward', 'backward', 'both').
Converting Back to NumPy Array or List
After interpolation, you can convert the Pandas Series back to a NumPy array or a Python list.
import pandas as pd
import numpy as np
# interpolated_pd_series defined as above
array_with_nans = np.array([10.0, 12.0, np.nan, 18.0, np.nan, np.nan, 27.0, 30.0])
pd_series = pd.Series(array_with_nans)
interpolated_pd_series = pd_series.interpolate(method='linear') # 'linear' is default
# Convert back to NumPy array
result_numpy_array_from_pandas = interpolated_pd_series.to_numpy() # Preferred over .values
print("Interpolated data as NumPy array (from Pandas):")
print(result_numpy_array_from_pandas)
print()
# Convert to Python list
result_list_from_pandas = interpolated_pd_series.tolist()
print("Interpolated data as Python list (from Pandas):")
print(result_list_from_pandas)
Output:
Interpolated data as NumPy array (from Pandas):
[10. 12. 15. 18. 21. 24. 27. 30.]
Interpolated data as Python list (from Pandas):
[10.0, 12.0, 15.0, 18.0, 21.0, 24.0, 27.0, 30.0]
Limitations and Considerations
- 1D Only: Both
numpy.interp()
andpandas.Series.interpolate()
(as shown) are primarily designed for 1D data. For 2D arrays, you would typically apply these methods row-wise or column-wise in a loop, or use more advanced 2D interpolation techniques (e.g., fromscipy.interpolate
). - Monotonic
xp
fornp.interp()
:numpy.interp()
requires thexp
array (indices of known points) to be monotonically increasing. This is naturally satisfied whenxp
are indices. - Extrapolation:
np.interp()
extrapolatesNaN
s at the beginning/end by using the nearest valid value. Pandas'interpolate()
offers more control over this with itslimit
andlimit_direction
parameters. - Sufficient Known Points: Linear interpolation needs at least two known data points to interpolate between them. If an array has fewer than two non-NaN values,
np.interp
might behave unexpectedly or fill with the single known value (ifinterpolate_nans_with_numpy
function is used). Pandas'interpolate
might leave NaNs if it can not find points to interpolate between based on its settings.
Conclusion
Interpolating NaN
values is a valuable technique for handling missing data in NumPy arrays.
- For 1D linear interpolation using pure NumPy, the
numpy.interp()
function provides the core mechanism. You need to manually identify theNaN
and non-NaN
positions and values to feed intonp.interp()
. - If Pandas is available, converting your NumPy array to a
pandas.Series
and using itsinterpolate()
method (e.g.,pd.Series(arr).interpolate().to_numpy()
) is often a more convenient and robust approach, offering more built-in options and handling of edge cases like leading/trailingNaN
s.
Choose the method that best suits your project's dependencies and the complexity of your interpolation needs. For simple 1D linear cases, both are effective.