Kodeclik Blog
Computing average of a Python numpy array discarding NaNs
Sometimes when working with Python numpy arrays, we might end up with some elements of the array being NaN (ie Not a Number). For instance, assume that you are given an array of numbers and you would like to find the average square root of the elements in the array. So we write a program such as:
import numpy as np
my_array = np.array([1, 4, 9, 16, 25, -36])
my_array_sqrts = np.sqrt(my_array)
print(my_array_sqrts)
print (np.mean(my_array_sqrts))
Note that this program creates a numpy array my_array containing six numbers: [1, 4, 9, 16, 25, -36] where we have conveniently included a negative number (to illustrate our idea in this blogpost). Then, it applies the square root function np.sqrt() to each element of the array, storing the results in my_array_sqrts. When calculating square roots, the program will generate a NaN (Not a Number) for the last element (-36) since the square root of a negative number is undefined in real numbers.
The output will be:
main.py:4: RuntimeWarning: invalid value encountered in sqrt
my_array_sqrts = np.sqrt(my_array)
[ 1. 2. 3. 4. 5. nan]
nan
Note that the program prints the resulting array of square roots, which will show the square roots of the positive numbers (1, 2, 3, 4, 5) followed by a NaN. When it tries to calculate the mean of all values in my_array_sqrts using np.mean(), however, the resulting mean will also be NaN because NumPy's default mean function doesn’t know what to do with NaN values.
Remember that -36 is indeed a number but square root of -36 is the computation that leads to an NaN.
So not only do we get an NaN, that NaN leads to further problems down the line.
What we really need is a way to conveniently discard these values when we are computing this average.
Example 1: numpy.nanmean() is numpy.mean() without NaNs
The np.nanmean() function is specifically designed to handle arrays containing NaN (Not a Number) values by excluding them from the calculation.
Consider the updated code:
import numpy as np
my_array = np.array([1, 4, 9, 16, 25, -36])
my_array_sqrts = np.sqrt(my_array)
print(my_array_sqrts)
print (np.nanmean(my_array_sqrts))
The output is:
main.py:4: RuntimeWarning: invalid value encountered in sqrt
my_array_sqrts = np.sqrt(my_array)
[ 1. 2. 3. 4. 5. nan]
3.0
Note that we still get the complaint in the console, the NaN in the array, but the average computation is no longer affected! The program correctly outputs 3, which is the average of all numbers sans the NaN.
Example 2: Finding average of an 1D array with explicit NaNs
Here is a second example:
import numpy as np
# Example 2: Simple 1D array
arr2 = np.array([10, np.nan, 20, 30, np.nan, 40])
result2 = np.nanmean(arr2)
print(f"1D Array average: {result2}")
This example creates an array with explicit NaN values using the “np.nan” notation, and then computes the average by ignoring the NaN values.
The output will be:
1D Array average: 25.0
Example 3: Finding row-wise averages of a 2D array
In this example we calculate row-wise averages in a 2D array using the axis=1 parameter:
import numpy as np
# Example 3: 2D array with row-wise average
arr3 = np.array([[10, 20, np.nan],
[40, 50, np.nan],
[np.nan, 6, np.nan]])
result3 = np.nanmean(arr3, axis=1)
print(f"2D Array row-wise averages: {result3}")
The output is:
2D Array row-wise averages: [15. 45. 6.]
Example 4: Finding column-wise averages of a multi-dimensional array
This example shows column-wise average calculation using axis=0, which is particularly useful when dealing with datasets that have missing values in different columns.
import numpy as np
arr4 = np.array([[24, 32, 85],
[57, np.nan, 16],
[8, 17, np.nan],
[43, 78, 39]])
result4 = np.nanmean(arr4, axis=0)
print(f"Column-wise averages: {result4}")
The output will be:
Column-wise averages: [33. 42.33333333 46.66666667]
In all of the above examples, the np.nanmean() function automatically adjusts the denominator in the mean calculation to account for only the non-NaN values, ensuring accurate averages.
You might be wondering - how would you find the mean of these arrays if you do not have access to the nanmean() method? Here are a couple of ideas for that!
Finding the mean using boolean indexing
The first idea uses boolean indexing with ~np.isnan() to create a mask that filters out NaN values. Here is some example code:
import numpy as np
# Create sample array with NaN values
arr = np.array([10, np.nan, 20, 30, np.nan, 40])
# Method 1: Using boolean indexing
mean1 = np.mean(arr[~np.isnan(arr)])
print("Method 1 (Boolean indexing):", mean1)
This is the most straightforward approach and works well for simple calculations.
The output will be:
Method 1 (Boolean indexing): 25.0
Finding the mean using masked arrays
The second method uses NumPy's masked array functionality, which is particularly useful in cases like this where we wish to mask (ie ignore) NaN values:
import numpy as np
# Create sample array with NaN values
arr = np.array([10, np.nan, 20, 30, np.nan, 40])
# Method 2: Using masked arrays
masked_arr = np.ma.masked_array(arr, np.isnan(arr))
mean2 = np.mean(masked_arr)
print("Method 2 (Masked array):", mean2)
The output will be:
Method 2 (Masked array): 25.0
Finding the row-wise mean in 2D arrays using boolean indexing
For 2D arrays, you can use list comprehension with boolean indexing to calculate means along specific axes:
import numpy as np
# For 2D arrays
arr2d = np.array([[10, 20, np.nan],
[40, np.nan, 60],
[70, 80, np.nan]])
# Row-wise mean using boolean indexing
row_means = np.array([np.mean(row[~np.isnan(row)]) for row in arr2d])
print("Row-wise means:", row_means)
The output will be:
Row-wise means: [15. 50. 75.]
In summary, these methods will produce the same results as np.nanmean() but offer more explicit control over how missing values are handled.
If you liked this blogpost, checkout our other numpy related blogposts, such as numpy.isnan(),numpy.unique(), and numpy.sum().
Want to learn Python with us? Sign up for 1:1 or small group classes.