Skip to main content

How to Split Strings on Spaces and Punctuation in Python

Splitting strings based on spaces, punctuation, or a combination thereof, is a frequent task in text processing. This guide explores various methods for achieving this in Python, covering:

  • Splitting by one or more spaces or a delimiter
  • Regular Expressions: re.split() and re.findall()
  • Using str.replace() and str.strip() for text cleaning
  • Creating lists of words with or without punctuation characters
  • Split using Regular Expressions re.findall()

Splitting by One or More Spaces with str.split()

The most basic way to split a string by any amount of whitespace is to call the str.split() method without any arguments:

my_str = 'a      b \nc d   \r\ne'
my_list = my_str.split()
print(my_list) # Output: ['a', 'b', 'c', 'd', 'e']
  • The split() function, when called with no separator, considers all consecutive whitespace as a single delimiter.

This also does not create empty list elements if there is any amount of whitespace in the beginning or end of the string.

my_str = '  alice   bob carl   diana     '
my_list = my_str.split()
print(my_list) # Output: ['alice', 'bob', 'carl', 'diana']

Splitting by Whitespace with re.split()

The re.split() method gives you extra power over string splitting using Regular Expressions. You can also use the method to split by whitespace by using the character sequence \s+ as a regex:

import re
my_str = 'a b \nc d \r\ne'
my_list = re.split(r'\s+', my_str)
print(my_list) # Output: ['a', 'b', 'c', 'd', 'e']
  • In this case \s+ is a regex code for one or more whitespace characters.

As with split() method, leading and trailing whitespace characters are also removed. To remove the whitespace characters, use the filter function:

import re

my_str = ' a b \nc d \r\ne '

my_list = list(filter(None, re.split(r'\s+', my_str)))
print(my_list) # Output: ['a', 'b', 'c', 'd', 'e']

Splitting Only on the First Space

To split a string only at the first occurrence of whitespace, set the maxsplit parameter to 1:

my_str = 'one two three four'
my_list = my_str.split(' ', 1)
print(my_list) # Output: ['one', 'two three four']

#For any whitespace character
my_list_2 = my_str.split(maxsplit=1)
print(my_list_2) # Output: ['one', 'two three four']
  • The .split(' ', 1) will only perform the split operation on the first whitespace character.
  • The maxsplit parameter will have the same effect if set, regardless of whether the delimiter is manually set, or if the delimiter will be implicitly assumed to be whitespace.

Splitting a String into a List of Words with re.findall()

The re.findall() method can also be used to split strings into lists of words, using the character code \w which matches most characters that can be part of a word in any language, numbers and underscores.

import re

my_str = 'one two, three four. five'
my_list = re.findall(r'[\w]+', my_str)
print(my_list) # Output: ['one', 'two', 'three', 'four', 'five']

Splitting a String into a List of Words using str.replace() and split()

To clean a string before splitting it into a list of words, you can remove specific characters such as punctuation marks, and then apply the split() method.

my_str = 'one two, three four. five'

my_list = my_str.replace(',', '').replace('.', '').split()
print(my_list) # Output: ['one', 'two', 'three', 'four', 'five']
  • The chained replace() calls remove the punctuation characters you don't want.
  • The split() method then splits the cleaned string into a list of words.

If you need to replace all known punctuation, and you don't care about stripping just one or two specific characters, you can use the string module:

import string
my_str = 'one two, three four. five'

my_list = [word.strip(string.punctuation) for word in my_str.split()]
print(my_list) # Output: ['one', 'two', 'three', 'four', 'five']
  • The .strip(string.punctuation) part removes any leading or trailing punctuation characters from each word in the list.

Splitting a String on Punctuation Marks

If you intend to split your string by punctuation marks, use the re.split() method, but change the regex to match specific characters:

import re

my_str = """One, Two Three. Four! Five? I'm!"""
my_list = re.split('[ ,.!?]', my_str) # Matches whitespace and punctuation
print(my_list)
# Output: ['One', '', 'Two', 'Three', '', 'Four', '', 'Five', '', "I'm", '']

my_list = list(filter(None, re.split('[ ,.!?]', my_str)))
print(my_list)
# Output: ['One', 'Two', 'Three', 'Four', 'Five', "I'm"]
  • The character class used in re.split indicates the delimiters that should be used.

Splitting a String into Words and Punctuation

In some cases you might not want to remove the punctuation.

You can use re.findall() to return a list of all matches in the string:

import re

my_str = """One, "Two" Three. Four! Five? I'm """
result = re.findall(r"[\w'\"]+|[,.!?]", my_str)

print(result)
# Output: ['One', ',', '"Two"', 'Three', '.', 'Four', '!', 'Five', '?', "I'm"]
  • The regex r"[\w'\"]+|[,.!?]" returns a list that contains all of the words and all of the delimiters it found.
  • In this case \w Matches all of the word-like characters or numbers.
  • | is a regex operator for or.
  • The other part of regex uses square brackets to extract exactly what punctuation characters you want to be extracted.