Kodeclik Blog
How to remove non-alphabetic characters from strings in Python
There will come a time when you are processing strings in your Python program and the string contains some “nuisance” characters that need to be removed. So here then is a blogpost on how to remove non-alphabetic characters from your Python string.
We explore three different methods to accomplish our objective!
Method 1: Use the isalpha() Method
This approach uses Python's built-in string method isalpha() in combination with a list comprehension and join().
inputstring = "K0odecl1ik!"
only_letters = ''.join(char for char in inputstring if char.isalpha())
print(only_letters)
In the above program, the isalpha() method checks each character to verify if it's an alphabetic letter. The list comprehension creates a new sequence containing only the characters that pass this check, and join() combines them back into a string. This method is straightforward and easy to understand, making it ideal for simple text cleaning tasks.
The output is:
Kodeclik
Method 2: Use Regular Expressions
This solution employs regular expressions through the re module.
import re
inputstring = "K0odecl1ik!"
clean_text = re.sub(r'[^a-zA-Z]', '', inputstring)
print(clean_text)
Here, the re.sub() function replaces all characters that don't match the pattern [^a-zA-Z] with an empty string. The caret ^ inside the square brackets means "not", so this pattern matches any character that is not a letter from a to z or A to Z.
The output is:
Kodeclik
As can be seen here, regular expressions offer more flexibility and power when dealing with complex pattern matching requirements.
Method 3: Use the translate() Method
The translate() method provides a highly efficient way to remove multiple characters at once.
import string
inputstring = "K0odecl1ik!"
translator = str.maketrans(‘’, ‘’, string.punctuation + string.digits + string.whitespace)
clean_text = inputstring.translate(translator)
print(clean_text)
This program creates a translation “table” using str.maketrans() that maps all punctuation, digits, and whitespace characters to None, effectively removing them. The string.punctuation, string.digits, and string.whitespace constants from the string module provide comprehensive lists of characters to remove. This method is particularly efficient when processing large strings because it performs the removal operation in a single pass.
Here are some applications of what you have learnt so far! In the below we slightly update the code so that we retain spaces for readability, but you can change it if you so desire.
Cleaning User Comments
Here we aim to sanitize user comments that might contain non-alphabetic characters:
user_comment= "Hey!!! This product is AMAZING!!! <3 :) Would buy again... 100% satisfied!!!"
clean_text = ''.join(char for char in user_comment if char.isalpha() or char.isspace())
print(clean_text)
# Output: Hey This product is AMAZING Would buy again satisfied
This method uses a list comprehension with two conditions. The char.isalpha() checks if a character is a letter, while char.isspace() checks if it's a whitespace character. The join() method then combines all retained characters. This approach is particularly useful for cleaning social media comments or user reviews where emoticons, excessive punctuation, and numbers are common but need to be removed while maintaining readability.
Cleaning Product Descriptions
Here we are standardizing product descriptions that might contain a range of non-alphabetic characters:
import re
product_desc = "$Special-Edition* Nike Air-Max 2024 (Limited Release) @ $299.99!!!"
clean_text = re.sub(r'[^a-zA-Z\s]', '', product_desc)
print(clean_text)
# Output: Special Edition Nike Air Max Limited Release
This solution uses regular expressions with the pattern [^a-zA-Z\s]. The ^ inside square brackets means "not", a-zA-Z represents all letters, and \s represents whitespace characters. The re.sub() function replaces all characters that don't match this pattern with an empty string. This method is ideal for cleaning product descriptions or titles that often contain special characters, model numbers, and prices.
Cleaning Email Subjects
Here we demonstrate how to clean email subject lines which can contain a lot of extraneous information:
import string
email_subject = "RE: [URGENT!!!] Your Order #12345 Status Update - Shipped!"
translator = str.maketrans('', '', ''.join(c for c in string.punctuation + string.digits if c != ' '))
clean_text = email_subject.translate(translator)
print(clean_text)
# Output: RE URGENT Your Order Status Update Shipped
This approach uses the translate() method with a custom translation table. The table is created by combining punctuation and digits from the string module, but explicitly excludes spaces using if c != ' '. The translate() method then removes all specified characters in a single efficient pass. This method is particularly effective for email subjects or headers that often contain various prefixes, brackets, and reference numbers.
So we have learnt three different methods to "clean up" your string. Which method is your favorite?
Enjoy this blogpost? Want to learn Python with us? Sign up for 1:1 or small group classes.