Python Pandas: Remove Rows with Non-Numeric Values in a Column
Data cleaning often involves ensuring that columns expected to contain numerical data actually do. Non-numeric entries (like strings or error codes mixed with numbers) can cause issues during calculations, plotting, or model training.
This guide explains how to identify and remove rows from a Pandas DataFrame where a specific column contains non-numeric values, primarily using pd.to_numeric()
and other type-checking methods.
The Goal: Ensuring Numeric Data in a Column
Given a Pandas DataFrame, we want to examine a specific column that should contain numbers. If any row in this column has a value that can not be interpreted as a number (e.g., it's a string like "N/A", "Error", or an accidental text entry), we want to remove that entire row from the DataFrame.
Example DataFrame with Mixed Data Types
import pandas as pd
import numpy as np # For np.nan and numeric types
data = {
'RecordID': [1, 2, 3, 4, 5, 6, 7],
'SensorValue': [10.5, 'Error', 12.3, -5.0, '15', np.nan, 'N/A'],
'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C'],
'Notes': ['OK', 'Sensor fault', 'OK', 'Low reading', 'OK', 'Missing', 'Not applicable']
}
df_original = pd.DataFrame(data)
print("Original DataFrame:")
print(df_original)
print()
print("Original dtypes:")
print(df_original.dtypes)
Output:
Original DataFrame:
RecordID SensorValue Category Notes
0 1 10.5 A OK
1 2 Error B Sensor fault
2 3 12.3 A OK
3 4 -5.0 C Low reading
4 5 15 B OK
5 6 NaN A Missing
6 7 N/A C Not applicable
Original dtypes:
RecordID int64
SensorValue object
Category object
Notes object
dtype: object
Our goal is to filter df_original
to keep only rows where 'SensorValue' is (or can be converted to) a valid number.
Method 1: Using pd.to_numeric()
with errors='coerce'
(Recommended)
This is the most robust and idiomatic Pandas way to handle this.
- Attempt to convert the target column to a numeric type using
pd.to_numeric()
. - Set
errors='coerce'
. This will replace any values that can not be converted to a number withNaN
(Not a Number). - Use
Series.notnull()
(orSeries.isna()
) on the converted column to create a boolean mask.notnull()
isTrue
for numbers andFalse
forNaN
. - Use this mask to filter the original DataFrame.
How It Works
import pandas as pd
import numpy as np
df = pd.DataFrame({
'RecordID': [1, 2, 3, 4, 5, 6, 7],
'SensorValue': [10.5, 'Error', 12.3, -5.0, '15', np.nan, 'N/A'],
})
# Step 1 & 2: Convert to numeric, coercing errors to NaN
numeric_sensor_values = pd.to_numeric(df['SensorValue'], errors='coerce')
print("'SensorValue' after pd.to_numeric(errors='coerce'):")
print(numeric_sensor_values)
print()
# Step 3: Create boolean mask (True for non-NaN/numeric values)
is_numeric_mask = numeric_sensor_values.notnull()
print("Boolean mask (is_numeric_mask):")
print(is_numeric_mask)
Output:
'SensorValue' after pd.to_numeric(errors='coerce'):
0 10.5
1 NaN
2 12.3
3 -5.0
4 15.0
5 NaN
6 NaN
Name: SensorValue, dtype: float64
Boolean mask (is_numeric_mask):
0 True
1 False
2 True
3 True
4 True
5 False
6 False
Name: SensorValue, dtype: bool
Implementation
import pandas as pd
import numpy as np
df = pd.DataFrame({
'RecordID': [1, 2, 3, 4, 5, 6, 7],
'SensorValue': [10.5, 'Error', 12.3, -5.0, '15', np.nan, 'N/A'],
'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C'],
})
# ✅ Filter rows where 'SensorValue' can be converted to numeric
df_numeric_rows = df[pd.to_numeric(df['SensorValue'], errors='coerce').notnull()]
print("DataFrame with non-numeric rows in 'SensorValue' removed:")
print(df_numeric_rows)
Output:
DataFrame with non-numeric rows in 'SensorValue' removed:
RecordID SensorValue Category
0 1 10.5 A
2 3 12.3 A
3 4 -5.0 C
4 5 15 B
This method effectively keeps rows where 'SensorValue' was originally a number, a string representing a number, or NaN
(which to_numeric
might pass through or also coerce depending on other arguments, but notnull()
will then filter out). The key is that strings like "Error" or "N/A" become NaN
and are then filtered out by notnull()
.
Method 2: Using Series.apply()
with isinstance()
You can apply a custom function to each element in the column to check if its type is one of the numeric types.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'RecordID': [1, 2, 3, 4, 5, 6, 7],
'SensorValue': [10.5, 'Error', 12.3, -5.0, '15', np.nan, 'N/A'], # '15' is string
'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C'],
})
# Define numeric types to check against (Python's and NumPy's)
numeric_types = (int, float, np.integer, np.floating)
# For older numpy: np.int64, np.float64 etc. More generally np.number for all numeric types.
# ✅ Create a boolean mask by checking the type of each element
is_instance_numeric_mask = df['SensorValue'].apply(lambda x: isinstance(x, numeric_types))
print("Boolean mask from apply(isinstance):")
print(is_instance_numeric_mask)
print()
df_numeric_apply = df[is_instance_numeric_mask]
print("DataFrame filtered using apply() and isinstance():")
print(df_numeric_apply)
Output:
Boolean mask from apply(isinstance):
0 True
1 False
2 True
3 True
4 False
5 True
6 False
Name: SensorValue, dtype: bool
DataFrame filtered using apply() and isinstance():
RecordID SensorValue Category
0 1 10.5 A
2 3 12.3 A
3 4 -5.0 C
5 6 NaN A
- This method checks the current type of the elements. It won't identify strings like
"15"
as numeric unless they are first converted. np.nan
is a float, soisinstance(np.nan, float)
isTrue
. If you want to excludeNaN
s as well, you'd addx is not np.nan
or usepd.notna(x)
in the lambda.- The
pd.to_numeric
approach (Method 1) is generally more robust for identifying values that can be numbers, even if stored as strings.
Method 3: Using Series.str.isnumeric()
(For Integer Strings Only)
If you expect your "numeric" values to be non-negative integer strings (e.g., "123", "0", "9999"), and want to filter out anything else (including floats like "3.14", negative numbers "-10", or non-digit strings), you can use Series.str.isnumeric()
. This requires converting the column to string type first if it isn't already.
import pandas as pd
df_isnumeric_check = pd.DataFrame({
'RecordID': [1, 2, 3, 4, 5],
'SensorValue': ['10', 'Error', '12.3', '-5', '150'], # Note: '12.3' and '-5' are not isnumeric()
'Category': ['A', 'B', 'A', 'C', 'B']
})
# Convert column to string type first to use .str accessor methods
# Fill NaN with a non-numeric placeholder if necessary before astype(str)
# or handle potential errors if non-strings exist that can't be str() easily.
# Here, we assume it can be converted, or use .astype(str) which handles most things.
is_strictly_numeric_string_mask = df_isnumeric_check['SensorValue'].astype(str).str.isnumeric()
print("Boolean mask from .str.isnumeric():")
print(is_strictly_numeric_string_mask)
print()
df_filtered_isnumeric = df_isnumeric_check[is_strictly_numeric_string_mask]
print("DataFrame filtered using .str.isnumeric():")
print(df_filtered_isnumeric)
Output:
Boolean mask from .str.isnumeric():
0 True
1 False
2 False
3 False
4 True
Name: SensorValue, dtype: bool
DataFrame filtered using .str.isnumeric():
RecordID SensorValue Category
0 1 10 A
4 5 150 B
str.isnumeric()
: ReturnsTrue
if all characters in the string are numeric characters and there is at least one character. It returnsFalse
for floats (due to.
), negative numbers (due to-
), and empty strings.- This method is very restrictive and usually only suitable if you specifically want to identify non-negative integer strings.
Important Considerations (e.g., Floats, Negative Numbers)
pd.to_numeric(..., errors='coerce')
: Most flexible. Handles integers, floats, and strings representing numbers. Converts non-convertible toNaN
.isinstance(x, (int, float, np.number))
: Checks current type. Doesn't convert strings like "123" to numeric.str.isnumeric()
: Only for non-negative integer strings.str.isdigit()
is similar.str.isdecimal()
is even more restrictive (only Unicode decimal digits).
Conclusion
To remove rows from a Pandas DataFrame where a specific column contains non-numeric values:
- The most robust and generally recommended method is to use
pd.to_numeric(df['your_column'], errors='coerce').notnull()
. This attempts to convert values to a numeric type, turns unconvertible values intoNaN
, and then allows you to filter out theseNaN
s (which now represent the original non-numeric entries).df_cleaned = df[pd.to_numeric(df['your_column'], errors='coerce').notnull()]
- Using
df['your_column'].apply(lambda x: isinstance(x, (int, float, np.number)))
can check existing types but won't convert numeric strings. df['your_column'].astype(str).str.isnumeric()
is only suitable if you are specifically looking for non-negative integer strings and want to exclude floats, negatives, and other non-digit characters.
Always inspect your data and choose the method that correctly identifies what you consider "non-numeric" for your specific use case. pd.to_numeric
with errors='coerce'
is often the best starting point.