Skip to main content

Python NumPy: How to Shuffle Two Arrays Together (In Unison)

When working with paired datasets in NumPy, such as features and corresponding labels, it's often necessary to shuffle them randomly while maintaining their row-wise correspondence. This "in unison" shuffling ensures that if a row is moved in one array, the corresponding row in the other array is moved to the same new position. This is crucial for tasks like creating random train/test splits or for data augmentation.

This guide will comprehensively demonstrate several effective methods to shuffle two (or more) NumPy arrays in unison. We'll focus on using numpy.random.permutation() to generate a common shuffled index, numpy.random.shuffle() on an index array, and leveraging the shuffle utility from sklearn.utils for a convenient, dedicated solution.

The Goal: Synchronized Shuffling of Corresponding Rows

Imagine you have two NumPy arrays, features_array and labels_array, where features_array[i] corresponds to labels_array[i]. If we shuffle them independently, this correspondence will be lost. Shuffling in unison means applying the same random permutation of row order to both arrays.

Let's define two sample NumPy arrays:

import numpy as np

# Array 1: Could represent features (e.g., 2 features per sample)
array_X = np.array([
[10, 100], # Sample 0
[20, 200], # Sample 1
[30, 300], # Sample 2
[40, 400], # Sample 3
[50, 500] # Sample 4
])

# Array 2: Could represent corresponding labels or other features
array_Y = np.array(['A', 'B', 'C', 'D', 'E']) # Labels for samples 0-4

print("Original array_X:")
print(array_X)
print("Original array_Y:")
print(array_Y)

Output:

Original array_X:
[[ 10 100]
[ 20 200]
[ 30 300]
[ 40 400]
[ 50 500]]
Original array_Y:
['A' 'B' 'C' 'D' 'E']

Our goal is to shuffle the rows of array_X and array_Y such that if row i of array_X moves to position j, then row i of array_Y also moves to position j.

numpy.random.permutation(x) can either randomly permute a sequence x or, if x is an integer, it returns a permuted range. We use the latter to generate a shuffled sequence of indices.

Generating a Permuted Index Array

We generate a permuted list of indices from 0 to N-1, where N is the number of rows (which must be the same for both arrays).

import numpy as np

# array_X and array_Y defined as above
array_X = np.array([
[10, 100], # Sample 0
[20, 200], # Sample 1
[30, 300], # Sample 2
[40, 400], # Sample 3
[50, 500] # Sample 4
])
array_Y = np.array(['A', 'B', 'C', 'D', 'E'])

# Ensure both arrays have the same number of rows (first dimension)
assert array_X.shape[0] == array_Y.shape[0], "Arrays must have the same number of rows to shuffle in unison."
num_rows = array_X.shape[0]

# Generate a permuted sequence of indices from 0 to num_rows-1
permuted_indices = np.random.permutation(num_rows)
print(f"Permuted indices: {permuted_indices}")

Output:

Permuted indices: [2 0 3 1 4]

Applying the Permuted Index to Both Arrays

Use this single permuted_indices array to reorder the rows of both array_X and array_Y. This is a form of "fancy indexing."

import numpy as np

# array_X, array_Y, and permuted_indices defined as above
array_X = np.array([
[10, 100], # Sample 0
[20, 200], # Sample 1
[30, 300], # Sample 2
[40, 400], # Sample 3
[50, 500] # Sample 4
])
array_Y = np.array(['A', 'B', 'C', 'D', 'E'])
permuted_indices = np.random.permutation(num_rows)

# Apply the same permuted indices to both arrays
shuffled_array_X = array_X[permuted_indices]
shuffled_array_Y = array_Y[permuted_indices]

print("Shuffled array_X (using permuted indices):")
print(shuffled_array_X)
print()

print("Shuffled array_Y (using the SAME permuted indices):")
print(shuffled_array_Y)

Output:

Shuffled array_X (using permuted indices):
[[ 10 100]
[ 20 200]
[ 30 300]
[ 50 500]
[ 40 400]]

Shuffled array_Y (using the SAME permuted indices):
['A' 'B' 'C' 'E' 'D']
note

The rows are shuffled, but the i-th row in shuffled_array_X still corresponds to the i-th element in shuffled_array_Y according to their original pairing.

Creating a Reusable Shuffling Function

import numpy as np

def shuffle_arrays_in_unison(arr1, arr2):
"""Shuffles two NumPy arrays along their first axis in unison."""
if arr1.shape[0] != arr2.shape[0]:
raise ValueError("Arrays must have the same number of rows (first dimension).")

permutation = np.random.permutation(arr1.shape[0])
return arr1[permutation], arr2[permutation]

# Example usage:
# array_X, array_Y defined above
array_X = np.array([
[10, 100], # Sample 0
[20, 200], # Sample 1
[30, 300], # Sample 2
[40, 400], # Sample 3
[50, 500] # Sample 4
])
array_Y = np.array(['A', 'B', 'C', 'D', 'E'])

shuffled_X, shuffled_Y = shuffle_arrays_in_unison(array_X, array_Y)
print("--- Using Reusable Function (permutation) ---")
print("Shuffled X:\n", shuffled_X)
print("Shuffled Y:\n", shuffled_Y)

Output:

--- Using Reusable Function (permutation) ---
Shuffled X:
[[ 50 500]
[ 30 300]
[ 40 400]
[ 20 200]
[ 10 100]]
Shuffled Y:
['E' 'C' 'D' 'B' 'A']

This function can be easily extended to shuffle more than two arrays by applying the same permutation to all of them.

Method 2: Using numpy.random.shuffle() on an Index Array

numpy.random.shuffle(x) shuffles a sequence x in-place. We can create an array of indices, shuffle it in-place, and then use these shuffled indices.

Creating and Shuffling an Index Array

import numpy as np

# array_X and array_Y defined as above
array_X = np.array([
[10, 100], # Sample 0
[20, 200], # Sample 1
[30, 300], # Sample 2
[40, 400], # Sample 3
[50, 500] # Sample 4
])
array_Y = np.array(['A', 'B', 'C', 'D', 'E'])

num_rows = array_X.shape[0]

# Create an array of indices [0, 1, 2, ..., num_rows-1]
indices_to_shuffle = np.arange(num_rows)
print(f"Original indices: {indices_to_shuffle}")

# Shuffle the indices in-place
np.random.shuffle(indices_to_shuffle)
print(f"Shuffled indices (in-place): {indices_to_shuffle}")

Output:

Original indices: [0 1 2 3 4]
Shuffled indices (in-place): [4 1 2 0 3]

Applying the Shuffled Index

This step is identical to 2.2, using indices_to_shuffle.

import numpy as np

# array_X, array_Y, and in-place shuffled indices_to_shuffle defined as above
array_X = np.array([
[10, 100], # Sample 0
[20, 200], # Sample 1
[30, 300], # Sample 2
[40, 400], # Sample 3
[50, 500] # Sample 4
])
array_Y = np.array(['A', 'B', 'C', 'D', 'E'])
num_rows = array_X.shape[0]
indices_to_shuffle = np.arange(num_rows)

shuffled_array_X_v2 = array_X[indices_to_shuffle]
shuffled_array_Y_v2 = array_Y[indices_to_shuffle]

print("--- Using np.random.shuffle() on indices ---")
print("Shuffled array_X (v2):\n", shuffled_array_X_v2)
print("Shuffled array_Y (v2):\n", shuffled_array_Y_v2)

Output:

--- Using np.random.shuffle() on indices ---
Shuffled array_X (v2):
[[ 10 100]
[ 20 200]
[ 30 300]
[ 40 400]
[ 50 500]]
Shuffled array_Y (v2):
['A' 'B' 'C' 'D' 'E']

The result is the same kind of synchronized shuffle as with permutation. The main difference is that shuffle modifies its argument in-place, while permutation returns a new permuted array.

Method 3: Using sklearn.utils.shuffle (Convenient for Machine Learning Contexts)

If you are already using Scikit-learn, its shuffle utility is designed for exactly this purpose and can handle multiple arrays.

Installation (if needed)

If you don't have Scikit-learn installed:

pip install scikit-learn
# Or for conda:
conda install scikit-learn

Applying sklearn.utils.shuffle

import numpy as np
from sklearn.utils import shuffle # Import the shuffle utility

# array_X and array_Y defined as above
array_X = np.array([
[10, 100], # Sample 0
[20, 200], # Sample 1
[30, 300], # Sample 2
[40, 400], # Sample 3
[50, 500] # Sample 4
])
array_Y = np.array(['A', 'B', 'C', 'D', 'E'])

# Shuffle array_X and array_Y in unison
# random_state ensures reproducibility; omit for different shuffle each time
shuffled_X_sklearn, shuffled_Y_sklearn = shuffle(array_X, array_Y, random_state=42)

print("--- Using sklearn.utils.shuffle ---")
print("Shuffled array_X (sklearn):")
print(shuffled_X_sklearn)
print("Shuffled array_Y (sklearn):")
print(shuffled_Y_sklearn)

Output:

--- Using sklearn.utils.shuffle ---
Shuffled array_X (sklearn):
[[ 20 200]
[ 50 500]
[ 30 300]
[ 10 100]
[ 40 400]]
Shuffled array_Y (sklearn):
['B' 'E' 'C' 'A' 'D']
note

The random_state parameter is useful for making your shuffles reproducible. If omitted, the shuffle will be different each time the code is run. You can pass more than two arrays to sklearn.utils.shuffle.

Choosing the Best Method

  • numpy.random.permutation(len(array)) (Method 1): This is generally the most idiomatic and direct NumPy approach for generating shuffled indices to apply to multiple arrays. It's clear and efficient. Often recommended.
  • np.arange() then np.random.shuffle() (Method 2): Also a valid NumPy approach. It's slightly more verbose due to the in-place shuffle requiring a separate index array creation.
  • sklearn.utils.shuffle() (Method 3): Very convenient if you are already working within the Scikit-learn ecosystem or prefer its direct syntax for shuffling multiple arrays together. The random_state for reproducibility is a nice built-in feature.

All three methods effectively achieve shuffling in unison. The choice often comes down to personal preference or the context of your existing codebase.

Conclusion

Shuffling two or more NumPy arrays in unison is essential for maintaining the correspondence between related data points while randomizing their order.

  1. The most common NumPy-native way is to generate a permuted set of indices using np.random.permutation(N) and then use these same indices to reorder each array.
  2. Alternatively, create an array of indices np.arange(N), shuffle it in-place with np.random.shuffle(), and then use these shuffled indices.
  3. For users of Scikit-learn, sklearn.utils.shuffle(arr1, arr2, ...) provides a dedicated and convenient function for this task.

By employing these techniques, you can confidently shuffle your paired NumPy arrays while preserving their critical row-wise relationships.