How to check that a string contains only letters, numbers, underscores and dashes?

I know how to do this if I repeat all the characters in a string, but I'm looking for a more elegant method.

+78
python string regex
Sep 18 '08 at 4:04 on
source share
11 answers

A regular expression will do the trick with a very small code:

import re ... if re.match("^[A-Za-z0-9_-]*$", my_little_string): # do something here 
+111
Sep 18 '08 at 4:08
source share

[Edit] There is not yet mentioned another solution, and it seems to be superior to the others that have been given so far in most cases.

Use string.translate to replace all valid characters in a string and see if we still have invalid characters. This is pretty fast, as it uses the basic C function to do the job, with the python small bytecode involved.

Obviously, performance is not everything: for the most readable solutions, it is probably the best approach when it’s not in the critical code, but just to see how the solutions add up, the performance of all the proposed methods is compared here, check_trans is the one which uses the string.translate method.

Test code:

 import string, re, timeit pat = re.compile('[\w-]*$') pat_inv = re.compile ('[^\w-]') allowed_chars=string.ascii_letters + string.digits + '_-' allowed_set = set(allowed_chars) trans_table = string.maketrans('','') def check_set_diff(s): return not set(s) - allowed_set def check_set_all(s): return all(x in allowed_set for x in s) def check_set_subset(s): return set(s).issubset(allowed_set) def check_re_match(s): return pat.match(s) def check_re_inverse(s): # Search for non-matching character. return not pat_inv.search(s) def check_trans(s): return not s.translate(trans_table,allowed_chars) test_long_almost_valid='a_very_long_string_that_is_mostly_valid_except_for_last_char'*99 + '!' test_long_valid='a_very_long_string_that_is_completely_valid_' * 99 test_short_valid='short_valid_string' test_short_invalid='/$%$%&' test_long_invalid='/$%$%&' * 99 test_empty='' def main(): funcs = sorted(f for f in globals() if f.startswith('check_')) tests = sorted(f for f in globals() if f.startswith('test_')) for test in tests: print "Test %-15s (length = %d):" % (test, len(globals()[test])) for func in funcs: print " %-20s : %.3f" % (func, timeit.Timer('%s(%s)' % (func, test), 'from __main__ import pat,allowed_set,%s' % ','.join(funcs+tests)).timeit(10000)) print if __name__=='__main__': main() 

My system results:

 Test test_empty (length = 0): check_re_inverse : 0.042 check_re_match : 0.030 check_set_all : 0.027 check_set_diff : 0.029 check_set_subset : 0.029 check_trans : 0.014 Test test_long_almost_valid (length = 5941): check_re_inverse : 2.690 check_re_match : 3.037 check_set_all : 18.860 check_set_diff : 2.905 check_set_subset : 2.903 check_trans : 0.182 Test test_long_invalid (length = 594): check_re_inverse : 0.017 check_re_match : 0.015 check_set_all : 0.044 check_set_diff : 0.311 check_set_subset : 0.308 check_trans : 0.034 Test test_long_valid (length = 4356): check_re_inverse : 1.890 check_re_match : 1.010 check_set_all : 14.411 check_set_diff : 2.101 check_set_subset : 2.333 check_trans : 0.140 Test test_short_invalid (length = 6): check_re_inverse : 0.017 check_re_match : 0.019 check_set_all : 0.044 check_set_diff : 0.032 check_set_subset : 0.037 check_trans : 0.015 Test test_short_valid (length = 18): check_re_inverse : 0.125 check_re_match : 0.066 check_set_all : 0.104 check_set_diff : 0.051 check_set_subset : 0.046 check_trans : 0.017 

In most cases, the translation approach looks best, which is very important with long valid strings, but is beaten out with regular expressions in test_long_invalid (presumably because the regular expression can help out right away, but you always have to scan the entire string to translate). The established approaches are usually the worst, beating regular expressions only for the case of an empty string.

Using all (x in the allowed_set for x in s) works well if it is issued earlier, but can be bad if it needs to go through each character. isSubSet and the given differences are comparable and are consistently proportional to the length of the string regardless of the data.

There is a similar difference between regular expression methods matching all valid characters and finding invalid characters. The comparison is a little better when checking for a long but fully valid string, but worse for invalid characters at the end of the string.

+22
Sep 18 '08 at 12:19
source share

There are many ways to achieve this, some of which are clearer than others. For each of my examples, “True” means that the string passed is valid, “False” means that it contains invalid characters.

First of all, there is a naive approach:

 import string allowed = string.letters + string.digits + '_' + '-' def check_naive(mystring): return all(c in allowed for c in mystring) 

Then the regular expression is used, you can do it with the re.match () function. Note that the '-' must be at the end of [], otherwise it will be used as a range delimiter. Also pay attention to $, which means "end of line". The other answers noted in this question use the special character class "\ w", I always prefer to use the explicit range of character classes with [], because it is easier to understand without having to look for a quick reference and case.

 import re CHECK_RE = re.compile('[a-zA-Z0-9_-]+$') def check_re(mystring): return CHECK_RE.match(mystring) 

Another solution noted that you can do the reverse match with regular expressions, I included this here. Note that [^ ...] inverts the character class because ^ is used:

 CHECK_INV_RE = re.compile('[^a-zA-Z0-9_-]') def check_inv_re(mystring): return not CHECK_INV_RE.search(mystring) 

You can also do something complicated with the set object. Take a look at this example, which removes all valid characters from the source string, leaving us with a set containing either a) nothing or b) offensive characters from the string:

 def check_set(mystring): return not set(mystring) - set(allowed) 
+15
Sep 18 '08 at 4:18
source share

If not for dashes and underscores, the simplest solution would be

 my_little_string.isalnum() 

(Section 3.6.1 in the Python Library Reference)

+11
18 Sept '08 at 10:49
source share

As an alternative to using regex, you can do this in Sets:

 from sets import Set allowed_chars = Set('0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-') if Set(my_little_sting).issubset(allowed_chars): # your action print True 
+5
18 Sept '08 at 10:47
source share
  pat = re.compile ('[^\w-]') def onlyallowed(s): return not pat.search (s) 
+4
Sep 18 '08 at 4:12
source share

Well, you can ask for help regex great here :)

the code:

 import re string = 'adsfg34wrtwe4r2_()' #your string that needs to be matched. regex = r'^[\w\d_()]*$' # you can also add a space in regex if u want to allow it in the string if re.match(regex,string): print 'yes' else: print 'false' 

Output:

 yes 

Hope this helps :)

+1
Nov 14 '13 at 6:04 on
source share

Regular expression can be very flexible.

 import re; re.fullmatch("^[\w-]+$", target_string) # fullmatch looks also workable for python 3.4 

\w : only [a-zA-Z0-9_]

So you need to add - char to justify the hyphen char.

+ : Match one or more repetitions of the previous character. I think you are not accepting empty input. But if you do, change to * .

^ : Matches the beginning of a line.

$ : Matches the end of a line.

You need these two special characters, since you need to avoid the following case. Unwanted characters such as & may appear between a suitable pattern.

&&&PATTERN&&PATTERN

+1
Jan 22 '19 at 11:05
source share

You can always use list comprehension and check the results with everyone, this will be a little less resource-intensive than using a regular expression: all([c in string.letters + string.digits + ["_", "-"] for c in mystring])

0
Sep 18 '08 at 4:12
source share

Here is something based on Jerub’s “naive approach” (naive are his words, not mine!):

 import string ALLOWED = frozenset(string.ascii_letters + string.digits + '_' + '-') def check(mystring): return all(c in ALLOWED for c in mystring) 

If ALLOWED was a string, I think that c in ALLOWED will iterate over each character in the string until it finds a match or reaches the end. Which, quoting Joel Spolsky, is a bit of a Schlemiel Painter algorithm .

But testing for existence in a set should be more efficient, or at least less dependent on the number of valid characters. Of course, this approach is a little faster on my machine. This is clear, and I think it does quite a lot for most cases (on my slow machine I can check tens of thousands of short lines in a split second). I like it.

ACTUALLY on my machine, regex is several times faster and as simple as that (possibly simpler). So this is probably the best way forward.

0
Nov 30 '12 at 16:50
source share

use regex and see if it matches!

 ([az][AZ][0-9]\_\-)* 
-2
Sep 18 '08 at 4:06
source share



All Articles