Skip to main content

Python NumPy: How to Get Column Names from a Structured Array (and Plain Arrays)

NumPy arrays, particularly 2D arrays, are often used to represent tabular data. While standard NumPy arrays don't inherently have named columns in the same way Pandas DataFrames do, NumPy offers structured arrays (also known as record arrays) which allow you to assign names and data types to different "fields" or columns. When working with structured arrays, especially those loaded from text files with headers, accessing these column names is a common requirement.

This guide will comprehensively demonstrate how to retrieve column names from NumPy structured arrays using the dtype.names attribute, show how to add names to a plain (unstructured) NumPy array by converting it to a structured array, and briefly touch upon how column names are accessed in Pandas DataFrames for context.

Understanding Column Names in NumPy: Structured vs. Plain Arrays

  • Plain NumPy ndarrays (e.g., np.array([[1,2],[3,4]])): These are homogeneous arrays where all elements are of the same data type. They do not have built-in named columns. You access elements by numerical indices (e.g., arr[0, 1]).
  • NumPy Structured Arrays (Record Arrays): These arrays can have elements with different data types, organized into "fields". Each field can be given a name, which acts like a column name. This is useful for representing heterogeneous tabular data.

The methods for getting "column names" primarily apply to structured arrays.

Getting Column Names from a NumPy Structured Array (dtype.names)

When you load data into a NumPy array in a way that creates a structured array (e.g., using np.genfromtxt with names=True or by defining a structured dtype), the names of these fields (columns) are accessible via the dtype.names attribute.

Creating a Structured Array from a File (np.genfromtxt)

Let's assume we have a text file employee_data.txt with a header row:

employee_data.txt:

employee_id,department,years_of_service,salary
E101,HR,5,70000
E102,IT,10,85000
E103,Sales,3,65000

We can load this using numpy.genfromtxt and tell it to use the first row as names:

import numpy as np

file_path = 'employee_data.txt'

# Load data, using the first line as field names (names=True)
# dtype=None tells NumPy to infer the types for each column
try:
structured_data_array = np.genfromtxt(
file_path,
names=True, # Use the first row as column names
delimiter=',',
dtype=None, # Infer dtypes for each column
encoding='utf-8' # Specify encoding
)
print("Loaded Structured NumPy Array (first 2 records):")
print(structured_data_array[:2])

except FileNotFoundError:
print(f"File not found: {file_path}. Please create it with the content above.")
structured_data_array = None # For script to continue

Output:

Loaded Structured NumPy Array (first 2 records):
[('E101', 'HR', 5, 70000) ('E102', 'IT', 10, 85000)]

Accessing Column Names with array.dtype.names

Once you have a structured array, the dtype.names attribute returns a tuple of the field (column) names.

import numpy as np

# Assume structured_data_array is loaded as above
file_path = 'employee_data.txt'
structured_data_array = np.genfromtxt(
file_path,
names=True, # Use the first row as column names
delimiter=',',
dtype=None, # Infer dtypes for each column
encoding='utf-8' # Specify encoding
)

if structured_data_array is not None:
column_names_tuple = structured_data_array.dtype.names
print("Column names (field names) from the structured array:")
print(column_names_tuple)
else:
print("Skipping dtype.names example as data was not loaded.")

Output:

olumn names (field names) from the structured array:
('employee_id', 'department', 'years_of_service', 'salary')

Accessing Individual Column Names

Since dtype.names returns a tuple, you can access individual names by their index.

import numpy as np

# Assume structured_data_array is loaded as above
file_path = 'employee_data.txt'
structured_data_array = np.genfromtxt(
file_path,
names=True, # Use the first row as column names
delimiter=',',
dtype=None, # Infer dtypes for each column
encoding='utf-8' # Specify encoding
)

if structured_data_array is not None:
column_names_tuple = structured_data_array.dtype.names
print(f"First column name: {column_names_tuple[0]}")
print(f"Second column name: {column_names_tuple[1]}")
# And so on...
else:
print("Skipping individual column name example.")

Output:

First column name: employee_id
Second column name: department

Adding Column Names to a Plain (Unstructured) NumPy Array

If you have a regular, unstructured NumPy array (e.g., created with np.array([[1,2],[3,4]])) and you want to associate column names with it for more descriptive access, you need to convert it into a structured array. The numpy.lib.recfunctions.unstructured_to_structured() function is designed for this.

Using numpy.lib.recfunctions.unstructured_to_structured

import numpy as np
import numpy.lib.recfunctions as rfn # recfunctions for structured array tools

plain_array = np.array([
[101, 'Alice', 75000.50],
[102, 'Bob', 82000.75],
[103, 'Charlie', 68000.00]
])
print("Original plain NumPy array:")
print(plain_array)
print()

# Define the desired column names and their corresponding data types
# This creates a structured dtype
new_dtype = np.dtype([
('EmployeeID', int),
('Name', 'U10'), # Unicode string of max length 10
('Salary', float)
])

# Convert the plain array to a structured array with named fields
array_with_named_cols = rfn.unstructured_to_structured(plain_array, dtype=new_dtype)

print("Array after converting to structured array with named columns:")
print(array_with_named_cols)
print()

# Now you can access its column names
print("Column names of the new structured array:")
print(array_with_named_cols.dtype.names)

Output:

Original plain NumPy array:
[['101' 'Alice' '75000.5']
['102' 'Bob' '82000.75']
['103' 'Charlie' '68000.0']]

Array after converting to structured array with named columns:
[(101, 'Alice', 75000.5 ) (102, 'Bob', 82000.75)
(103, 'Charlie', 68000. )]

Column names of the new structured array:
('EmployeeID', 'Name', 'Salary')

Accessing Data by New Column Names

Once converted to a structured array, you can access data by these new field names.

import numpy as np
import numpy.lib.recfunctions as rfn # recfunctions for structured array tools

# array_with_named_cols defined as above
plain_array = np.array([
[101, 'Alice', 75000.50],
[102, 'Bob', 82000.75],
[103, 'Charlie', 68000.00]
])
new_dtype = np.dtype([
('EmployeeID', int),
('Name', 'U10'), # Unicode string of max length 10
('Salary', float)
])
array_with_named_cols = rfn.unstructured_to_structured(plain_array, dtype=new_dtype)

if 'array_with_named_cols' in locals(): # Check if variable exists
print("Accessing data by new column names:")
print("Employee IDs:", array_with_named_cols['EmployeeID'])
print("Salaries:", array_with_named_cols['Salary'])
else:
print("Skipping data access example as array_with_named_cols was not created.")

Output:

Accessing data by new column names:
Employee IDs: [101 102 103]
Salaries: [75000.5 82000.75 68000. ]

Context: Getting Column Names from a Pandas DataFrame (for Comparison)

It's useful to contrast this with how column names are handled in Pandas, as many users work with both libraries. Pandas DataFrames inherently have named columns.

Assume employee_data.csv is the same content as employee_data.txt but with a .csv extension.

import pandas as pd

file_path_csv = 'employee_data.csv' # Assumed to exist

try:
df = pd.read_csv(file_path_csv)
print("Pandas DataFrame (first 2 rows):")
print(df.head(2))
print()

# Get column names from a Pandas DataFrame
pandas_column_names = df.columns
print("Column names from Pandas DataFrame (Index object):")
print(pandas_column_names)
print()

# To get as a Python list:
pandas_column_names_list = df.columns.tolist()
print("Column names as a Python list:")
print(pandas_column_names_list)
print()

except FileNotFoundError:
print(f"Pandas example skipped: File not found - {file_path_csv}")

Output:

Pandas DataFrame (first 2 rows):
employee_id department years_of_service salary
0 E101 HR 5 70000
1 E102 IT 10 85000

Column names from Pandas DataFrame (Index object):
Index(['employee_id', 'department', 'years_of_service', 'salary'], dtype='object')

Column names as a Python list:
['employee_id', 'department', 'years_of_service', 'salary']
note

In Pandas, df.columns returns an Index object containing the column labels. df.columns.tolist() converts this to a standard Python list.

Conclusion

Getting "column names" from a NumPy array depends on its type:

  • For NumPy structured arrays (often created by functions like np.genfromtxt(names=True) or by defining a dtype with named fields), the column names (or field names) are directly accessible as a tuple via the array.dtype.names attribute.
  • For plain (unstructured) NumPy arrays, they do not have inherent column names. If you need to associate names with their columns, you must first convert the plain array into a structured array, for example, using numpy.lib.recfunctions.unstructured_to_structured() by defining a new dtype with field names and types.

This is distinct from Pandas DataFrames, where column names are a fundamental attribute accessible via df.columns. Understanding this distinction is key when working with tabular data in NumPy.