Python NumPy: How to Shuffle Two Arrays Together (In Unison)
When working with paired datasets in NumPy, such as features and corresponding labels, it's often necessary to shuffle them randomly while maintaining their row-wise correspondence. This "in unison" shuffling ensures that if a row is moved in one array, the corresponding row in the other array is moved to the same new position. This is crucial for tasks like creating random train/test splits or for data augmentation.
This guide will comprehensively demonstrate several effective methods to shuffle two (or more) NumPy arrays in unison. We'll focus on using numpy.random.permutation()
to generate a common shuffled index, numpy.random.shuffle()
on an index array, and leveraging the shuffle
utility from sklearn.utils
for a convenient, dedicated solution.
The Goal: Synchronized Shuffling of Corresponding Rows
Imagine you have two NumPy arrays, features_array
and labels_array
, where features_array[i]
corresponds to labels_array[i]
. If we shuffle them independently, this correspondence will be lost. Shuffling in unison means applying the same random permutation of row order to both arrays.
Let's define two sample NumPy arrays:
import numpy as np
# Array 1: Could represent features (e.g., 2 features per sample)
array_X = np.array([
[10, 100], # Sample 0
[20, 200], # Sample 1
[30, 300], # Sample 2
[40, 400], # Sample 3
[50, 500] # Sample 4
])
# Array 2: Could represent corresponding labels or other features
array_Y = np.array(['A', 'B', 'C', 'D', 'E']) # Labels for samples 0-4
print("Original array_X:")
print(array_X)
print("Original array_Y:")
print(array_Y)
Output:
Original array_X:
[[ 10 100]
[ 20 200]
[ 30 300]
[ 40 400]
[ 50 500]]
Original array_Y:
['A' 'B' 'C' 'D' 'E']
Our goal is to shuffle the rows of array_X
and array_Y
such that if row i
of array_X
moves to position j
, then row i
of array_Y
also moves to position j
.
Method 1: Using numpy.random.permutation()
(Recommended NumPy Approach)
numpy.random.permutation(x)
can either randomly permute a sequence x
or, if x
is an integer, it returns a permuted range. We use the latter to generate a shuffled sequence of indices.
Generating a Permuted Index Array
We generate a permuted list of indices from 0
to N-1
, where N
is the number of rows (which must be the same for both arrays).
import numpy as np
# array_X and array_Y defined as above
array_X = np.array([
[10, 100], # Sample 0
[20, 200], # Sample 1
[30, 300], # Sample 2
[40, 400], # Sample 3
[50, 500] # Sample 4
])
array_Y = np.array(['A', 'B', 'C', 'D', 'E'])
# Ensure both arrays have the same number of rows (first dimension)
assert array_X.shape[0] == array_Y.shape[0], "Arrays must have the same number of rows to shuffle in unison."
num_rows = array_X.shape[0]
# Generate a permuted sequence of indices from 0 to num_rows-1
permuted_indices = np.random.permutation(num_rows)
print(f"Permuted indices: {permuted_indices}")
Output:
Permuted indices: [2 0 3 1 4]
Applying the Permuted Index to Both Arrays
Use this single permuted_indices
array to reorder the rows of both array_X
and array_Y
. This is a form of "fancy indexing."
import numpy as np
# array_X, array_Y, and permuted_indices defined as above
array_X = np.array([
[10, 100], # Sample 0
[20, 200], # Sample 1
[30, 300], # Sample 2
[40, 400], # Sample 3
[50, 500] # Sample 4
])
array_Y = np.array(['A', 'B', 'C', 'D', 'E'])
permuted_indices = np.random.permutation(num_rows)
# Apply the same permuted indices to both arrays
shuffled_array_X = array_X[permuted_indices]
shuffled_array_Y = array_Y[permuted_indices]
print("Shuffled array_X (using permuted indices):")
print(shuffled_array_X)
print()
print("Shuffled array_Y (using the SAME permuted indices):")
print(shuffled_array_Y)
Output:
Shuffled array_X (using permuted indices):
[[ 10 100]
[ 20 200]
[ 30 300]
[ 50 500]
[ 40 400]]
Shuffled array_Y (using the SAME permuted indices):
['A' 'B' 'C' 'E' 'D']
The rows are shuffled, but the i
-th row in shuffled_array_X
still corresponds to the i
-th element in shuffled_array_Y
according to their original pairing.
Creating a Reusable Shuffling Function
import numpy as np
def shuffle_arrays_in_unison(arr1, arr2):
"""Shuffles two NumPy arrays along their first axis in unison."""
if arr1.shape[0] != arr2.shape[0]:
raise ValueError("Arrays must have the same number of rows (first dimension).")
permutation = np.random.permutation(arr1.shape[0])
return arr1[permutation], arr2[permutation]
# Example usage:
# array_X, array_Y defined above
array_X = np.array([
[10, 100], # Sample 0
[20, 200], # Sample 1
[30, 300], # Sample 2
[40, 400], # Sample 3
[50, 500] # Sample 4
])
array_Y = np.array(['A', 'B', 'C', 'D', 'E'])
shuffled_X, shuffled_Y = shuffle_arrays_in_unison(array_X, array_Y)
print("--- Using Reusable Function (permutation) ---")
print("Shuffled X:\n", shuffled_X)
print("Shuffled Y:\n", shuffled_Y)
Output:
--- Using Reusable Function (permutation) ---
Shuffled X:
[[ 50 500]
[ 30 300]
[ 40 400]
[ 20 200]
[ 10 100]]
Shuffled Y:
['E' 'C' 'D' 'B' 'A']
This function can be easily extended to shuffle more than two arrays by applying the same permutation
to all of them.
Method 2: Using numpy.random.shuffle()
on an Index Array
numpy.random.shuffle(x)
shuffles a sequence x
in-place. We can create an array of indices, shuffle it in-place, and then use these shuffled indices.
Creating and Shuffling an Index Array
import numpy as np
# array_X and array_Y defined as above
array_X = np.array([
[10, 100], # Sample 0
[20, 200], # Sample 1
[30, 300], # Sample 2
[40, 400], # Sample 3
[50, 500] # Sample 4
])
array_Y = np.array(['A', 'B', 'C', 'D', 'E'])
num_rows = array_X.shape[0]
# Create an array of indices [0, 1, 2, ..., num_rows-1]
indices_to_shuffle = np.arange(num_rows)
print(f"Original indices: {indices_to_shuffle}")
# Shuffle the indices in-place
np.random.shuffle(indices_to_shuffle)
print(f"Shuffled indices (in-place): {indices_to_shuffle}")
Output:
Original indices: [0 1 2 3 4]
Shuffled indices (in-place): [4 1 2 0 3]
Applying the Shuffled Index
This step is identical to 2.2, using indices_to_shuffle
.
import numpy as np
# array_X, array_Y, and in-place shuffled indices_to_shuffle defined as above
array_X = np.array([
[10, 100], # Sample 0
[20, 200], # Sample 1
[30, 300], # Sample 2
[40, 400], # Sample 3
[50, 500] # Sample 4
])
array_Y = np.array(['A', 'B', 'C', 'D', 'E'])
num_rows = array_X.shape[0]
indices_to_shuffle = np.arange(num_rows)
shuffled_array_X_v2 = array_X[indices_to_shuffle]
shuffled_array_Y_v2 = array_Y[indices_to_shuffle]
print("--- Using np.random.shuffle() on indices ---")
print("Shuffled array_X (v2):\n", shuffled_array_X_v2)
print("Shuffled array_Y (v2):\n", shuffled_array_Y_v2)
Output:
--- Using np.random.shuffle() on indices ---
Shuffled array_X (v2):
[[ 10 100]
[ 20 200]
[ 30 300]
[ 40 400]
[ 50 500]]
Shuffled array_Y (v2):
['A' 'B' 'C' 'D' 'E']
The result is the same kind of synchronized shuffle as with permutation
. The main difference is that shuffle
modifies its argument in-place, while permutation
returns a new permuted array.
Method 3: Using sklearn.utils.shuffle
(Convenient for Machine Learning Contexts)
If you are already using Scikit-learn, its shuffle
utility is designed for exactly this purpose and can handle multiple arrays.
Installation (if needed)
If you don't have Scikit-learn installed:
pip install scikit-learn
# Or for conda:
conda install scikit-learn
Applying sklearn.utils.shuffle
import numpy as np
from sklearn.utils import shuffle # Import the shuffle utility
# array_X and array_Y defined as above
array_X = np.array([
[10, 100], # Sample 0
[20, 200], # Sample 1
[30, 300], # Sample 2
[40, 400], # Sample 3
[50, 500] # Sample 4
])
array_Y = np.array(['A', 'B', 'C', 'D', 'E'])
# Shuffle array_X and array_Y in unison
# random_state ensures reproducibility; omit for different shuffle each time
shuffled_X_sklearn, shuffled_Y_sklearn = shuffle(array_X, array_Y, random_state=42)
print("--- Using sklearn.utils.shuffle ---")
print("Shuffled array_X (sklearn):")
print(shuffled_X_sklearn)
print("Shuffled array_Y (sklearn):")
print(shuffled_Y_sklearn)
Output:
--- Using sklearn.utils.shuffle ---
Shuffled array_X (sklearn):
[[ 20 200]
[ 50 500]
[ 30 300]
[ 10 100]
[ 40 400]]
Shuffled array_Y (sklearn):
['B' 'E' 'C' 'A' 'D']
The random_state
parameter is useful for making your shuffles reproducible. If omitted, the shuffle will be different each time the code is run. You can pass more than two arrays to sklearn.utils.shuffle
.
Choosing the Best Method
numpy.random.permutation(len(array))
(Method 1): This is generally the most idiomatic and direct NumPy approach for generating shuffled indices to apply to multiple arrays. It's clear and efficient. Often recommended.np.arange()
thennp.random.shuffle()
(Method 2): Also a valid NumPy approach. It's slightly more verbose due to the in-place shuffle requiring a separate index array creation.sklearn.utils.shuffle()
(Method 3): Very convenient if you are already working within the Scikit-learn ecosystem or prefer its direct syntax for shuffling multiple arrays together. Therandom_state
for reproducibility is a nice built-in feature.
All three methods effectively achieve shuffling in unison. The choice often comes down to personal preference or the context of your existing codebase.
Conclusion
Shuffling two or more NumPy arrays in unison is essential for maintaining the correspondence between related data points while randomizing their order.
- The most common NumPy-native way is to generate a permuted set of indices using
np.random.permutation(N)
and then use these same indices to reorder each array. - Alternatively, create an array of indices
np.arange(N)
, shuffle it in-place withnp.random.shuffle()
, and then use these shuffled indices. - For users of Scikit-learn,
sklearn.utils.shuffle(arr1, arr2, ...)
provides a dedicated and convenient function for this task.
By employing these techniques, you can confidently shuffle your paired NumPy arrays while preserving their critical row-wise relationships.