Skip to main content

Python NumPy: How to Interpolate NaN Values in an Array

Missing data, often represented as NaN (Not a Number) in NumPy arrays, can pose challenges for many numerical computations and analyses. Interpolation is a common technique to estimate these missing values based on the existing data points in the array. For 1D arrays, linear interpolation is a frequently used method where a missing value is estimated by fitting a straight line between its nearest known neighbors.

This guide will comprehensively demonstrate how to perform 1D linear interpolation of NaN values in a NumPy array using the numpy.interp() function. We will also explore a convenient alternative by leveraging the Series.interpolate() method from the Pandas library for a more direct approach.

Understanding Interpolation for Missing Values (NaNs)

Interpolation is the process of estimating unknown values that fall between known data points. When dealing with NaNs in a NumPy array, linear interpolation aims to fill these gaps by assuming a linear relationship between the valid (non-NaN) data points surrounding the NaN.

For example, if we have [1, NaN, 3], linear interpolation would estimate the NaN as 2. If we have [1, NaN, NaN, 4], the two NaNs might be estimated as 2 and 3 respectively.

Method 1: Using numpy.interp() for 1D Linear Interpolation

The numpy.interp(x, xp, fp) function performs one-dimensional linear interpolation. It finds the values of a function (fp) at new points (x) given a set of known data points (xp, fp), where xp must be monotonically increasing.

How numpy.interp() Works

  • x: The x-coordinates at which to evaluate the interpolated values (in our case, the indices of NaNs).
  • xp: The x-coordinates of the known data points (indices of non-NaN values). Must be increasing.
  • fp: The y-coordinates (values) of the known data points, corresponding to xp.

Step-by-Step Implementation for Interpolating NaNs

To use np.interp() for NaNs in a 1D array:

  1. Identify the indices of NaN values (these are our x points).
  2. Identify the indices of non-NaN values (these are our xp points).
  3. Get the actual non-NaN values (these are our fp points).
  4. Call np.interp() and assign the results back to the NaN positions in the array.
import numpy as np

array_with_nans = np.array([10.0, 12.0, np.nan, 18.0, np.nan, np.nan, 27.0, 30.0])
print(f"Original array: {array_with_nans}")

# Make a copy to avoid modifying the original array if needed elsewhere
interpolated_array_np = array_with_nans.copy()

# 1. Find indices of NaNs (our 'x' for interpolation)
nan_indices = np.isnan(interpolated_array_np).nonzero()[0]
print(f"Indices of NaN values (x): {nan_indices}")

# 2. Find indices of non-NaNs (our 'xp')
not_nan_indices = (~np.isnan(interpolated_array_np)).nonzero()[0]
print(f"Indices of non-NaN values (xp): {not_nan_indices}")

# 3. Get the actual non-NaN values (our 'fp')
known_values = interpolated_array_np[~np.isnan(interpolated_array_np)]
# or: known_values = interpolated_array_np[not_nan_indices]
print(f"Known values (fp): {known_values}")

# 4. Perform interpolation and assign back
# Only interpolate if there are known points to interpolate from
if len(known_values) > 1 and len(nan_indices) > 0 : # Need at least 2 known points for interp
interpolated_values = np.interp(nan_indices, not_nan_indices, known_values)
interpolated_array_np[nan_indices] = interpolated_values

print(f"Array after np.interp() interpolation: {interpolated_array_np}")

Output:

Original array: [10. 12. nan 18. nan nan 27. 30.]
Indices of NaN values (x): [2 4 5]
Indices of non-NaN values (xp): [0 1 3 6 7]
Known values (fp): [10. 12. 18. 27. 30.]
Array after np.interp() interpolation: [10. 12. 15. 18. 21. 24. 27. 30.]

A Reusable Function for np.interp()

import numpy as np

def interpolate_nans_with_numpy(array_like):
"""
Interpolates NaN values in a 1D NumPy array using linear interpolation.
Handles NaNs at the beginning or end by propagating the nearest valid value (extrapolation).
"""
arr = np.array(array_like, dtype=float).copy() # Ensure float for NaNs and copy

nan_mask = np.isnan(arr)
if not np.any(nan_mask): # No NaNs to interpolate
return arr

x_coords_all = np.arange(len(arr))

known_x_coords = x_coords_all[~nan_mask]
known_y_values = arr[~nan_mask]

if len(known_y_values) < 2: # Cannot interpolate with less than 2 known points
# Handle cases: all NaNs, or only one known point (fill all NaNs with it)
if len(known_y_values) == 1:
arr[nan_mask] = known_y_values[0]
return arr # Or raise error, or return original if all NaNs

nan_x_coords = x_coords_all[nan_mask]

interpolated_values = np.interp(nan_x_coords, known_x_coords, known_y_values)
arr[nan_mask] = interpolated_values

return arr

# Example usage:
test_array1 = np.array([1, 1, np.NaN, 2, 2, np.NaN, 3, 3, np.NaN])
print(f"Interpolated test_array1: {interpolate_nans_with_numpy(test_array1)}")

test_array2 = np.array([np.nan, 1, np.nan, 2, np.nan])
print(f"Interpolated test_array2: {interpolate_nans_with_numpy(test_array2)}")

Output:

Interpolated test_array1: [1.  1.  1.5 2.  2.  2.5 3.  3.  3. ]
Interpolated test_array2: [1. 1. 1.5 2. 2. ]
note

np.interp also performs extrapolation for points outside the range of xp, using the first or last fp value. This means NaNs at the very beginning or end of the array will be filled with the nearest valid data point.

Method 2: Using pandas.Series.interpolate() (Convenient Alternative)

If you have Pandas installed, using Series.interpolate() is often more straightforward as it's designed for this kind of task and handles edge cases well.

Converting NumPy Array to Pandas Series

First, convert your NumPy array to a Pandas Series.

import pandas as pd # Make sure pandas is installed: pip install pandas
import numpy as np

array_with_nans = np.array([10.0, 12.0, np.nan, 18.0, np.nan, np.nan, 27.0, 30.0])

# Convert NumPy array to Pandas Series
pd_series = pd.Series(array_with_nans)
print("Pandas Series from NumPy array:")
print(pd_series)

Output:

Pandas Series from NumPy array:
0 10.0
1 12.0
2 NaN
3 18.0
4 NaN
5 NaN
6 27.0
7 30.0
dtype: float64

Applying Series.interpolate()

The interpolate() method on a Pandas Series fills NaN values. By default, it uses linear interpolation.

import pandas as pd
import numpy as np

# pd_series defined as above
array_with_nans = np.array([10.0, 12.0, np.nan, 18.0, np.nan, np.nan, 27.0, 30.0])
pd_series = pd.Series(array_with_nans)

# Interpolate NaN values (default method is 'linear')
interpolated_pd_series = pd_series.interpolate(method='linear') # 'linear' is default

print("Pandas Series after .interpolate():")
print(interpolated_pd_series)

Output:

Pandas Series after .interpolate():
0 10.0
1 12.0
2 15.0
3 18.0
4 21.0
5 24.0
6 27.0
7 30.0
dtype: float64
note

Pandas' interpolate() offers various methods beyond linear (e.g., 'polynomial', 'spline'), providing more advanced options if needed. It also handles NaNs at the beginning/end more flexibly with limit_direction parameter (e.g. 'forward', 'backward', 'both').

Converting Back to NumPy Array or List

After interpolation, you can convert the Pandas Series back to a NumPy array or a Python list.

import pandas as pd
import numpy as np

# interpolated_pd_series defined as above
array_with_nans = np.array([10.0, 12.0, np.nan, 18.0, np.nan, np.nan, 27.0, 30.0])
pd_series = pd.Series(array_with_nans)
interpolated_pd_series = pd_series.interpolate(method='linear') # 'linear' is default

# Convert back to NumPy array
result_numpy_array_from_pandas = interpolated_pd_series.to_numpy() # Preferred over .values
print("Interpolated data as NumPy array (from Pandas):")
print(result_numpy_array_from_pandas)
print()

# Convert to Python list
result_list_from_pandas = interpolated_pd_series.tolist()
print("Interpolated data as Python list (from Pandas):")
print(result_list_from_pandas)

Output:

Interpolated data as NumPy array (from Pandas):
[10. 12. 15. 18. 21. 24. 27. 30.]

Interpolated data as Python list (from Pandas):
[10.0, 12.0, 15.0, 18.0, 21.0, 24.0, 27.0, 30.0]

Limitations and Considerations

  • 1D Only: Both numpy.interp() and pandas.Series.interpolate() (as shown) are primarily designed for 1D data. For 2D arrays, you would typically apply these methods row-wise or column-wise in a loop, or use more advanced 2D interpolation techniques (e.g., from scipy.interpolate).
  • Monotonic xp for np.interp(): numpy.interp() requires the xp array (indices of known points) to be monotonically increasing. This is naturally satisfied when xp are indices.
  • Extrapolation: np.interp() extrapolates NaNs at the beginning/end by using the nearest valid value. Pandas' interpolate() offers more control over this with its limit and limit_direction parameters.
  • Sufficient Known Points: Linear interpolation needs at least two known data points to interpolate between them. If an array has fewer than two non-NaN values, np.interp might behave unexpectedly or fill with the single known value (if interpolate_nans_with_numpy function is used). Pandas' interpolate might leave NaNs if it can not find points to interpolate between based on its settings.

Conclusion

Interpolating NaN values is a valuable technique for handling missing data in NumPy arrays.

  • For 1D linear interpolation using pure NumPy, the numpy.interp() function provides the core mechanism. You need to manually identify the NaN and non-NaN positions and values to feed into np.interp().
  • If Pandas is available, converting your NumPy array to a pandas.Series and using its interpolate() method (e.g., pd.Series(arr).interpolate().to_numpy()) is often a more convenient and robust approach, offering more built-in options and handling of edge cases like leading/trailing NaNs.

Choose the method that best suits your project's dependencies and the complexity of your interpolation needs. For simple 1D linear cases, both are effective.