Python NumPy: How to Get N Random Rows from a 2D Array
Selecting a random subset of rows from a 2D NumPy array is a common task in data sampling, bootstrapping, or when preparing data for machine learning (e.g., creating random mini-batches). NumPy, with its powerful random number generation capabilities, provides several efficient ways to achieve this.
This guide will comprehensively demonstrate methods for extracting N random rows from a NumPy array, covering scenarios both with replacement (where rows can be selected multiple times) and without replacement (ensuring each selected row is unique). We'll explore techniques using numpy.random.randint
, numpy.random.choice
, and numpy.random.shuffle
.
The Goal: Randomly Subsetting Array Rows
Given a 2D NumPy array, our objective is to select a specified number (N) of its rows at random. The key considerations are:
- With Replacement: Can the same row be picked multiple times in our sample of N rows?
- Without Replacement: Must every selected row be unique (i.e., once a row is picked, it can not be picked again if N is less than or equal to the total number of rows)?
Let's define a sample 2D NumPy array:
import numpy as np
data_array = np.array([
[10, 20, 30], # Row 0
[40, 50, 60], # Row 1
[70, 80, 90], # Row 2
[11, 22, 33], # Row 3
[44, 55, 66] # Row 4
])
print("Original 2D NumPy Array:")
print(data_array)
Output:
Original 2D NumPy Array:
[[10 20 30]
[40 50 60]
[70 80 90]
[11 22 33]
[44 55 66]]
Method 1: Random Row Selection with Replacement using numpy.random.randint()
This method involves generating N random row indices (which can have duplicates) and then using these indices to select rows from the original array.
import numpy as np
num_rows_to_select = 3
total_rows_in_array = data_array.shape[0] # Gets the number of rows (5 in this case)
# Generate 'num_rows_to_select' random integers between 0 (inclusive)
# and 'total_rows_in_array' (exclusive). Replacement is implicit.
random_indices_with_replacement = np.random.randint(
low=0,
high=total_rows_in_array,
size=num_rows_to_select
)
print(f"Random indices (with replacement): {random_indices_with_replacement}\n")
# Use these indices to select rows. The ':' selects all columns for these rows.
selected_rows_with_replacement = data_array[random_indices_with_replacement, :]
# Or simply: selected_rows_with_replacement = data_array[random_indices_with_replacement]
print("N random rows (with replacement):")
print(selected_rows_with_replacement)
Output:
Random indices (with replacement): [0 2 1]
N random rows (with replacement):
[[10 20 30]
[70 80 90]
[40 50 60]]
Because np.random.randint
generates integers independently, the same index can appear multiple times, leading to row repetition in the output.
Method 2: Random Row Selection Without Replacement using numpy.random.choice()
(Recommended)
For selecting unique random rows (without replacement), numpy.random.choice()
is the ideal tool. It allows you to draw a random sample from a given 1D array (in this case, an array of row indices) and specify whether replacement is allowed.
import numpy as np
num_rows_to_select_unique = 3
total_rows_in_array = data_array.shape[0]
# Generate 'num_rows_to_select_unique' random indices from the range [0, total_rows_in_array - 1]
# replace=False ensures each selected index (and thus row) is unique.
random_unique_indices = np.random.choice(
a=total_rows_in_array, # Generate indices from arange(0, total_rows_in_array)
size=num_rows_to_select_unique,
replace=False # Crucial for no replacement
)
print(f"Random unique indices (without replacement): {random_unique_indices}\n")
# Select rows using these unique indices
selected_rows_without_replacement = data_array[random_unique_indices, :]
# Or: selected_rows_without_replacement = data_array[random_unique_indices]
print("N random rows (without replacement):")
print(selected_rows_without_replacement)
Output:
Random unique indices (without replacement): [2 4 3]
N random rows (without replacement):
[[70 80 90]
[44 55 66]
[11 22 33]]
When replace=False
, numpy.random.choice
ensures that all generated indices are distinct. The size
parameter must be less than or equal to a
(or the length of a
if a
is an array) when replace=False
.
Method 3: Shuffling the Array and Taking the Top N Rows using numpy.random.shuffle()
Another way to get N unique random rows is to shuffle a copy of the array (or the array itself, if in-place modification is acceptable) and then take the first N rows. numpy.random.shuffle()
shuffles the array along its first axis (rows) in-place.
import numpy as np
num_rows_to_select_shuffle = 2
# Important: np.random.shuffle modifies the array in-place.
# If you don't want to change the original data_array, work on a copy.
array_copy_for_shuffle = data_array.copy()
np.random.shuffle(array_copy_for_shuffle) # Shuffles rows of array_copy_for_shuffle in-place
print("Array after shuffling (array_copy_for_shuffle):")
print(array_copy_for_shuffle)
# Select the first 'num_rows_to_select_shuffle' rows from the shuffled array
selected_rows_shuffled = array_copy_for_shuffle[:num_rows_to_select_shuffle, :]
# Or: selected_rows_shuffled = array_copy_for_shuffle[:num_rows_to_select_shuffle]
print(f"{num_rows_to_select_shuffle} random rows after shuffling:")
print(selected_rows_shuffled)
Output:
Array after shuffling (array_copy_for_shuffle):
[[40 50 60]
[10 20 30]
[44 55 66]
[11 22 33]
[70 80 90]]
2 random rows after shuffling:
[[40 50 60]
[10 20 30]]
This method naturally provides selection without replacement. Note that np.random.shuffle()
returns None
and modifies the array directly.
Creating a Reusable Function for Random Row Selection
If you frequently need to select random rows without replacement, a helper function can be convenient.
import numpy as np
def get_n_random_rows(input_array, num_rows, replace_sampling=False):
"""
Selects N random rows from a 2D NumPy array.
Args:
input_array (np.ndarray): The 2D input array.
num_rows (int): The number of random rows to select.
replace_sampling (bool): If True, sample with replacement.
If False, sample without replacement.
Returns:
np.ndarray: A new array containing N randomly selected rows.
"""
if not isinstance(input_array, np.ndarray) or input_array.ndim != 2:
raise ValueError("Input must be a 2D NumPy array.")
if not isinstance(num_rows, int) or num_rows <= 0:
raise ValueError("Number of rows to select must be a positive integer.")
if not replace_sampling and num_rows > input_array.shape[0]:
raise ValueError("Cannot select more rows than available when replace=False.")
total_available_rows = input_array.shape[0]
if replace_sampling:
random_indices = np.random.randint(0, total_available_rows, size=num_rows)
else:
random_indices = np.random.choice(total_available_rows, size=num_rows, replace=False)
return input_array[random_indices] # Slices all columns for the selected rows
# --- Example usage of the function ---
data_array = np.array([
[10, 20, 30], # Row 0
[40, 50, 60], # Row 1
[70, 80, 90], # Row 2
[11, 22, 33], # Row 3
[44, 55, 66] # Row 4
])
print("Using reusable function (without replacement):")
sample1 = get_n_random_rows(data_array, 3, replace_sampling=False)
print(sample1)
print()
print("Using reusable function (with replacement):")
sample2 = get_n_random_rows(data_array, 4, replace_sampling=True)
print(sample2)
Output:
Using reusable function (without replacement):
[[70 80 90]
[40 50 60]
[10 20 30]]
Using reusable function (with replacement):
[[10 20 30]
[10 20 30]
[44 55 66]
[70 80 90]]
Conclusion
NumPy offers several effective methods for selecting N random rows from a 2D array:
- For selection with replacement (rows can be repeated): Use
numpy.random.randint()
to generate row indices and then use these indices to slice the array. - For selection without replacement (each row is unique):
numpy.random.choice(total_rows, size=N, replace=False)
is the most direct and recommended method for generating unique random row indices.numpy.random.shuffle()
on a copy of the array followed by taking the first N rows is also effective but involves an in-place modification (on the copy). Choosing the appropriate method depends on whether you need unique rows (without replacement) or if repetitions are allowed (with replacement). Thenp.random.choice
method generally provides the most flexibility and clarity for these tasks.